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Preface 


The material in this book is based on my several years’ experience in 
construction and evaluation of examinations, first as a member of the 
Board of Examinations of the University of Chicago, later as director 
of a war research project developing aptitude and achievement tests 
for the Bureau of Naval Personnel, and at present as research adviser 
for the Educational Testing Service. Collection and presentation of the 
material have been furthered for me by teaching courses in statistics 
and test theory at The University of Chicago and now at Princeton 
University. 

During this time I have become aware of the necessity for a firm 
grounding in test theory for work in test development. When this book 
was begun the material on test theory was available in numerous articles 
scattered through the literature and in books written some time ago, and 
therefore not presenting recent developments. It seemed desirable to 
me to bring the technical developments in test theory of the last fifty 
years together in one readily available source. 

Although this book is written primarily for those working in test 
development, it is interesting to note that the techniques presented here 
are applicable in many fields other than test construction. Many of 
the difficulties that have been encountered and solved in the testing 
field also confront workers in other areas, such as measurement of atti- 
tudes or opinions, appraisal of personality, and clinical diagnosis. For 
example, in each of these fields the error of measurement is large com- 
pared to the differences that the scientist is seeking; hence the methods 
of dealing with and reducing error of measurement developed in con- 
nection with testing are pertinent. Methods of adjusting results to 
take account of differences in group variability have been developed in 
testing, and they are helpful in arriving at appropriate conclusions when- 
ever the apparent results of an experiment are affected by group varia- 
bility. If measurements in any field are to merit confidence, the scien- 
tist must demonstrate that they are repeatable. Thus the theoretical 
and experimental work on reliability as developed for tests may be 
utilized in numerous other areas, as, for example, clinical diagnosis or 


personality appraisal. In any situation where a single decision or a 
v 
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diagnosis is made on the basis of several different types of evidence, the 
material on weighting methods is applicable. 

As the writing of the book progressed, there were several develop- 
ments that seemed especially interesting to me from a technical point 
of view. The basic formulas of test theory are derived from two different 
sets of assumptions. In Chapter 2 the derivation is based on a definition 
of random error, with true score being simply a difference between 
observed and error score. In Chapter 3 the same formulas are derived 
from a definition of true score, error being the difference between ob- 
served and true scores. The treatment of the effects of test length and 
of group heterogeneity in terms of invariants may enable workers in the 
field of statistical theory to furnish test technicians with appropriate 
statistical criteria for this invariance. The distinction between explicit 
and incidental selection and the development of the theory of incidental 
selection for the multivariate case may facilitate the proper use of correc- 
tions for restriction of range. It was especially interesting to me to 
work on the beginning of a rationale for power tests that are partly 
speeded, as given in Chapter 17. This theory should help in determining 
test time limits and their effect on estimates of reliability. The initia- 
tion of a systematic mathematical theory for item analysis, as indicated 
in Chapter 21, should help in constructing tests that are suited for 
different specific purposes. I hope that the discussion of weighting 
methods presented in Chapter 20 will assist in clarifying problems in 
this field. 

Illustrative computing diagrams are given for those formulas simple 
enough to be changed into a linear form. In general, if we work with 
curved lines on a computing diagram, the labor of computing a large 
number of points for each line is prohibitive for the individual worker. 
If the diagram can be made up of straight lines, it is necessary to compute 
only one or two points for each line, and a graph of the proper size 
and scaled with values appropriate for any particular operation can be 
set up in a few hours. To be used for actual computing, such diagrams 
must be much larger than the illustrative ones given here. A minimum 
size of 8 by 10 inches and a large size of 20 by 30 inches will usually be 
found suitable. I hope that the diagrams given here will illustrate the 
principles of construction so well that any worker who has occasion to 
use any one of these formulas a great number of times can construct 
a larger and more detailed diagram with a scale appropriate to his 
own problem. 

The major part of this book is desi 


à gned for readers with the following 
preparation: 


Preface vii 


1. A knowledge of elementary algebra, including such topics as the 
binomial expansion, the solution of simultaneous linear equations, and 
the use of tables of logarithms. . 

2. Some familiarity with analytical geometry, emphasizing primarily 
the equation of the straight line, although some use is made of the 
equations for the circle, ellipse, hyperbola, and parabola. 

3. A knowledge of elementary statistics, including such topics as the 
computation and interpretation of means, standard deviations, corre- 
lations, errors of estimate, and the constants of the equation of the 
regression line. It is assumed that the students know how to make and 
to interpret frequency diagrams of various sorts, including the histo- 
gram, frequency polygon, normal curve, cumulative frequency curve, 
and the correlation scatter diagram. Familiarity with tables of the 
normal curve and with significance tests is also assumed. 


A brief résumé of the major formulas from algebra, analytical geom- 
etry, and statistics that are assumed in this book is given in Appendix A. 
In order to include a more complete coverage of some major topics 
in test theory, certain exceptions were made to the foregoing require- 


ments. 


1. Chapter 5, sections 3 and 4, assumes an elementary knowledge of 
analysis of variance, including a first-order interaction. 

2. Chapter 10, section 4, assumes a knowledge of the least squares 
procedure for fitting a second-degree curve. 

3. Chapter 13 assumes an understanding of the elements of matrix 
theory. - i 

4. Chapter 14 assumes the ability to use determinants for those cases 
involving more than three variables. For the case of three tests, the 
formulas of Chapter 14 are written without the use of determinantal 
notation. . 

5. Chapter 20, sections 2 and 4, assumes a knowledge of maxima and 
minima in caleulus and of the solution of simultaneous linear equations 
by determinants. 

6. Chapter 20, sections 12 and 13, assumes an understanding of the 
use of the Lagrange multiplier in solving a matrix equation. 

The rest of the material in the book has been written so that the parts 
requiring advanced preparation may be omitted without disturbing the 


continuity of the material. : m 
My suggestion is that students with the minimum preparation in 


elementary algebra and statistics should omit the following material: 
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` Chapter 5 (sections 3 and 4), Chapter 10 (section 4), Chapter 13, and 
Chapter 20 (sections 1 to 4, 12, and 13). If lack of time makes it neces- 
sary to curtail assignments still further, Chapters 3, 7, 14, and 17 may 
be omitted also without disturbing the continuity of the treatment. Ad- 
vanced students whose preparation includes calculus, matrix theory, 
and analysis of variance may not need to study Chapters 2, 6, and 11. 


HAROLD GULLIKSEN 
- Princeton, New Jersey 
August, 1950 
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Symbols 


Although certain chapters required a special set of symbols, generally 
the following notation is used in this book. 


X, Y, Z, or W denotes the gross score or raw score on a test. 

i and j are subscripts designating persons. 

g and h are subscripts designating tests. 

x, y, z, or w denotes deviation scores, the gross score minus the mean. 

N denotes total number of persons in a group. 

n denotes the number of persons in a subgroup. 

K denotes the number of items in a test, or the number of tests in 
a battery. 

k denotes the number of items in a subtest. 

T equals the gross true score. 

t equals the deviation true score, the gross true score minus the mean 
of these scores. 

E equals the gross error score (random error). 

e equals the deviation error score (since the average error is zero, 
E =e). 

M, m, X, Y equal the sample mean. 

S, s, X, Y equal the sample standard deviation. 

r and R equal a sample correlation coefficient. 

u equals the population mean. 

c equals the standard deviation for the population. 

p equals the correlation for the population. 

¢ equals the covariance for the population. 
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Introduction 


It is interesting to note that during the 1890's several attempts were 
made in this country to utilize the new methods of measurement of 
individual differences in order to predict college grades. J. McKeen 
Cattell and his student Clark Wissler tried a large number of psycho- 
logical tests and correlated them with grades in various subjects at 
Columbia University; see Cattell (1890), Cattell and Farrand (1896), 
and Wissler (1901). The correlations between the psychological tests 
and the grades were around zero, the highest correlation being .19. A 
similar attempt by Gilbert (1894), at Yale, produced similarly disap- 
pointing results. 

Scientific confidence in the possibilities of measuring individual dif- 
ferences revived in this country with the introduction of the Binet 
scale and the quantitative techniques developed by Karl Pearson and 
Charles Spearman at the beginning of the twentieth century. Nearly 
all the basic formulas that are particularly useful in test theory are 
found in Spearman’s early papers; see Spearman (1904a), (19040), 
(1907), (1910), and (1913). Since then development of both the theory 
and the practical aspects of aptitude and achievement testing has pro- 
gressed rapidly. Aptitude and achievement tests are widely used in 
education and in industry. 

Since 1900 great progress has been made toward a unified quantita- 
tive theory that describes the behavior of test items and test scores 
under various conditions. This mathematical rationale applicable to 
mental tests should not be confused with statistics. A good foundation 
in elementary statistics and elementary mathematics is a prerequisite 
for work in the theory of mental tests. In addition, as the theory of 
mental tests is developed, the necessity arises for various statistical 
criteria to determine whether or not a given set of test data agrees 
with the theory, within reasonable sampling limits. The theory, how- 
ever, must first be developed without consideration of sampling errors, 
and then the statistical problems in conjunction with sampling can be 


considered. i 
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This book deals with the mathematical theory and statistical methods 
ised in interpreting test results. There are numerous non-quantitative 
sroblems involved in constructing aptitude or achievement tests that 
re not considered here. Non-quantitative problems such as choice of 
tem types or matching the examination to the objectives of a curricu- 
um are discussed in the University of Chicago Manual of Examination 
Methods (1937); Englehart (1942); Hawkes, Lindquist, and Mann 
(1936); Hull (1928); Orleans (1937); Ruch (1929); and others. There- 
fore, no attempt is made here to familiarize the student with the various 
psychological and educational tests now available or with the scope of 
the many testing programs. Such material is surveyed in yearbooks by 
Buros (1936), (1937), (1938), (1941), and (1949); Hildreth (1939); Lee 
and Symonds (1934); the National Society for the Study of Education, 
the 17th Yearbook (1918); Ruger (1918); Whipple (1914), (1915); Free- 
man (1939); Mursell (1947); Ross (1947); Goodenough (1949); Cron- 
bach (1949); and other general textbooks listed in the bibliography. 

In constructing tests, analyzing and interpreting the results, there are 
five major types of problems: 


1. Writing and selecting the test items. 

2. Assigning a score to each person. 

3. Determining the accuracy (reliability or error of measurement) of 
the test scores. 

4. Determining the predictive value of the test scores (validity or 
error of estimate). 

5. Comparing the results with those obtained using other tests or 
other groups of subjects. In making these comparisons, it is neces- 
sary to consider the effect of test length and group heterogeneity 
on the various measures of the accuracy and the predictive value 
of the test scores. 


In dealing with any given test these problems would arise chronolog- 
ically in the order in which they are given above. However, the theory 
of the selection of test items depends upon comparing them with some 
test score or scores; therefore it is convenient to consider first the theory 
dealing with the accuracy of these test scores. Similarly the evaluation 
of experimental methods of determining reliability and the discussion 
of practical methods of setting up parallel tests depend upon a theoreti- 
cal concept of reliability and of parallel tests. "Therefore, instead of 
beginning with practical problems of item selection, experimental 
methods of determining reliability, or of setting up parallel tests, we 
shall begin with the theoretical constructs. 

An ideal model will be set up giving the measures of accuracy of test 
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scores and the theoretical effects of changes in test length and in group 
heterogeneity. The theory of these changes will be derived from as- 
sumptions regarding parallel tests and selection procedures, without 
inquiring very closely into the experimental methods that are appro- 
priate for realizing these assumptions. Beginning with Chapter 14, 
various practical problems relating to the construction of parallel tests, 
criteria for parallel tests, experimental methods of determining reliabil- 
ity, etc., will be considered. It is felt that postponing such practical 
considerations until the latter part of the book has the advantage of 
giving the student a firm foundation in theory first. Then on the basis 
of this familiarity with the ideal situation, various practical procedures 
can be evaluated in terms of the closeness with which they approximate 
the theoretically perfect method. To consider practical experimental 
procedures without such a grounding in the theoretical foundation 
leaves these procedures as approximations to something that is not yet 
clearly stated or understood. 

The basic theoretical material on accuracy of test scores is presented 
in Chapters 2 through 5, which deal with the topics of test reliability 
and the error of measurement. The effect of test length upon reliability 
and validity is considered in Chapters 6 through 9, and the effect of 
group heterogeneity on measures of accuracy in Chapters 10 through 13. 
In these chapters we give only a theoretical definition of parallel tests, 
and we define reliability as the correlation between two parallel forms. 
This simplified presentation of the concept of parallel tests and of re- 
liability makes it possible to concentrate on the theory of test reliability 
and test validity before taking up the short-cuts and approximations 
that are frequently used in actual practice. Practical problems of 
criteria for parallel tests are given in Chapter 14, and experimental 
methods of determining reliability when a parallel form is not used are 
considered in Chapters 15 and 16. Methods of scoring, scaling, and 
equating tests are considered in Chapters 18 and 19. Problems dealing 
with batteries of tests are considered in Chapter 20, and problems of 


item selection in Chapter 21. 


J 


Basic Equations Derived 
from a Definition of Random Error 


1. Introduction 


We shall begin by assuming the conventional objective testing pro- 
cedure in which the person is presented with a number of items to be 
answered. Each answer is scored as correct or incorrect, and a simple 
or a weighted sum of the correct answers is taken as the test score. 
The various procedures for determining which items to use and the 
best weighting methods will be considered later. For the present we 
assume that the numerical score is based on a count, one or more 
points for each correct answer and zero for each incorrect answer, and 
we turn our attention to the determination of the accuracy of this score. 

When psychological measurement. is compared with the type of 
measurement found in physics, many points of similarity and difference 
are found. One of the very important differences is that the error of 
measurement in most psychological work is very much greater than it 
is in physics. For example, Jackson and Ferguson (1941) resorted to 
specially constructed “rubber rulers” in order to reduce the reliability 
of length measurements to values appreciably below .99. The estima- 
tion of the error in a set of test scores and the differentiation between 


"error" and “true” score on a test are central problems in mental 
measurement. 


2. The basic assumption of test theory 


It is necessary to make some assumption regarding the relationship 
between true scores and error scores. Let us define three basic symbols. 


X; — the score of the ith person on the test under consideration. 
T'; = the true score of the ith person on this test. 
E; = the error component for the same person. 


In defining these symbols it is assumed that the gross score has two 

components. One of these components (T) represents the actual ability 

of the person, a quantity that will be relatively stable from test to test 
4 
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as long as the tests are measuring the same thing. The other compo- 
nent (Æ) is an error. It is due to the various factors that may cause a 
person sometimes to answer correctly an item that he does not know, 
and sometimes to answer incorrectly an item that he does know. So 
far, it will be observed, there is no proposition subject to any experi- 
mental check. We have simply said that there is some number 7 that 
would be the person’s correct score, and that the obtained score (X) 
does not necessarily equal T. 

It is possible to make many different assumptions regarding the rela- 
tionship between the three terms X, 7, and E. The one made in test 
theory is the simplest possible assumption, namely, that 


(1) X;—-T;4E; o E;—-X;—T, 


This equation may be regarded as an assumption that states the rela- 
tionship between true and error score; or it may be regarded as an 
equation defining what we are going to mean by error. In other words, 
once we accept the concept of a true score existing that differs from the 
observed score, we may then say that the difference between these two 


scores is going to be called error. 


3. The problem of determining characteristics of true and 

error score 

It may be noted that so far we have but one equation with two un- 
knowns (T and E). It cannot be solved to determine the values of T 
and of E for the person. If we test additional people, the situation does 
not become any more determinate. Each new score brings one new 
equation, like equation (1), and also two new unknowns. However, we 
may note that with measures on many persons we would have three 
frequency distributions—the distribution of X's, of T's, and of E's. Let 
us investigate to see if we can learn something about the characteristics 
or parameters of these frequency distributions. Can we determine or 
make reasonable assumptions about the means, the standard devia- 
tions, or the intercorrelations of these three distributions? 

There are two equivalent approaches to the problem of determining 
the characteristics of the distributions of T and E. 

1. A definition of error score is given, and the true score is regarded 
simply as the difference between the observed score and the error score. 
Intuitively this approach is somewhat unsatisfying, since the main 
attention is concentrated upon the error part, which is to be ignored, 
and the important component (true score) is just what happens to be 
left over. However, the basic equations can be derived quite simply 


from this assumption. 
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2. The other approach is to define the true score and then to let the 
difference between observed and true score be called error. This ap- 
proach is probably intuitively more satisfying, since attention is first 
concentrated upon getting a reasonable true score, and the error is a 
remainder. However this approach results in much more difficult 
equations. Since the first approach is the simpler to follow, let us 
consider it next. 


4. Definition of random errors 


Tn this approach to the problem of determining the means, the stand- 
ard deviations, and the intercorrelations of true, observed, and error 
scores, we define more carefully just what is meant by error. In dealing 
with errors of measurement, it is necessary to recognize that there are 
two basic types of error. They are termed random or chance errors, on 
the one hand, and constant or systematic errors, on the other. 

If measurements are consistently larger than they should be or are 
consistently smaller than they should be we have what is termed “con- 
stant error.” For example, if a tape measure has stretched with use 
and age, measurements made with it would be smaller than those made 
with an accurate tape, and there would be a systematic negative error. 
The error would be negative because it is customary to measure error 
as "the obtained measure minus the correct measure." The terms 
random, chance, or unsystematic errors on the other hand refer to dis- 
crepancies that are sometimes large and sometimes small, sometimes 
positive and sometimes negative. 

The basic assumptions of test theory deal with the definition and the 
estimation of chance errors. These “random errors” are the only errors 
that will be explicitly considered in test theory. For many purposes, 
constant errors can be ignored, since the process of establishing test 
norms takes care of constant errors that may appear in the gross score 
on the test. 

Since we are dealing with random errors, it is not unreasonable to 
assume that over a sufficiently large number of cases the average error 
of this type will be zero. We may write this assumption, 


(2) Mz = 0, 


and note that the larger the number of cases in the distribution, the 
closer will this assumption be approximated. This equation may also 
be regarded, not as an assumption, but as a part of the definition of 
random errors. By random errors we mean errors that average to zero 
over a large number of cases. Stating more exactly, we can say that the 
mean error will differ from zero by an amount that will be smaller than 


—— 
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any assigned quantity however small, if the number of cases is sufficiently 
large. In actual practice, however, it is customary to assume that equa- 
tion 2 holds exactly for any particular sample that is being considered. 

Turning to a consideration of the relationship between error score 
and true score, we can see that there is no reason to expect positive 
errors to occur oftener with high than with low true scores, and that the 
same holds for negative errors. Likewise there is no reason to expect 
large errors to occur oftener with low than with high true scores. It is 
reasonable to assume that as the number of cases increases the correla- 
tion between true and error scores approaches zero. We may write the 


equation 

(3) rre = 0 

and note that it comes closer and closer to being correct as the number 
of cases increases. Like equation 2, equation 3 is not so much an assump- 
tion as a definition. If the errors correlate with true score, they are not 
random errors. In such a case there is a systematic tendency for per- 
sons with high scores (or low scores) to have the larger errors. In prac- 
tice it is assumed in testing work that equation 3 holds for any given set 
of test data. 

The only other equation needed to define random error relates to the 
correlation between error on one test and error on another parallel test. 
As before, we can point out that there is no reason to expect a relation- 
ship, and that if a relationship existed between error scores on one test 
and error scores on a second test, we should have some systematic and 
predictable source of error and not a random error. In other words, by 
definition the correlation between two sets of random errors is zero or 
approaches zero as the number of cases increases. We may use the 
subscripts 1 and 2 to represent any two parallel tests and write 


(4) Tg,g, = O. 
Again, as before, we note that strietly speaking this assumption is true 


only as the number of cases approaches infinity. In practice, however, 


it is assumed to hold for any given set of test data. 
We may summarize the foregoing material in the following three 


definitions of random error: 
The mean error is zero (equation 2). The correlation be- 


tween error score and true score is zero (equation 3). The 
correlation between errors on one test and those on another 


parallel test is zero (equation 4). 
We should finally note again that actually these definitions do not hold 


8 The Theory of Mental Tests [Chap. 2 


unless the number of cases is very large, but that in practice it is cus- 
tomary to assume that they hold for any given set of test data. 


5. Determination of mean true score 


- « In order to determine the mean true score we note from the definition 
of equation 1 that 


(5) T5 = X; = E;. 
Summing both sides gives 

N N 
(6) 2c Ti- 25 (Xi = Ej. 


i=l i=l 


Removing the parentheses and omitting the subscripts and limits (since 
they are all identical), we have 


(7) ZT = 2X — ZE. 

Dividing by N (the number of cases) to obtain the mean gives 
(8) Mr = Mx — Mz. 

Using equation 2, we can see that 

(9) Mr = Mx. 


The mean true score equals the mean observed score. Equa- 
tion 9 is based only on the definition of equation 1 and the 
assumplion of equation 2. 


6. Relationship between true and error variance 

Next let us determine the relationship between the standard devia- 
tions of the true, the error, and the observed scores. From the defini- 
tion of equation 1 and from equation 9 we may write 


(10) X—Mx=T+E- Mr. 

Let us use lower-case letters to represent deviation scores. That is, 
(11) e ge Xk = Mx. 

(12) t= T= Mey, 

and since Mz equals zero from the definition of equation 2, we have 
(13) e-— E. 


Substituting equations 11, 12, and 13 in equation 10 gives 
(14) z-—i-e. 
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Squaring and summing gives 
(15) Za? = S(t + o). 
Removing the parentheses gives 
Za? = DPE De? 4- Oder f^. 


Dividing both sides by N, we have 


(16) | 827 = s? + S + 2rusis,. 
Substituting the definition of equation 3 in equation 16 gives 
(17) 8° = sP + s,2. 


The variance of the observed scores is equal to the sum of the 
true variance and error variance. It should be noted that 
equation 17 may be derived solely from the definition of equa- 
tion 1 and the assumption of equation 8. 


The relationship between these three variances is shown in Figure 1. 
Such a diagram can readily be constructed to cover any particular set 


i 2 
Fiaure 1. Computing diagram for observed, true, and error variance (s? = s + s). 
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of values for the three variances. The true variance and the error 
variance are indicated, one on the abscissa, the other on the ordinate. 
The set of diagonal lines indicates points such that the sum of the true 
and the error variance is constant; hence each diagonal line can be 
marked with the appropriate value of s 


Se 
Ficure 2. Computing diagram for observed, truc, and error standard deviations 


(sz = Vs? +82). 


The computing diagram of Figure 1 is the basic one for addition of 
two quantities. The x and y scales should be set up to cover the appro- 
priate range of values. If the two scales are the same, a set of 45-degree 
lines can be marked to indicate the sum of the x and y values. If it is 
not feasible to use the same units on the x and y scales, the slope of the 
"sum" lines must be different from 45 degrees. 

The error variance (s,2) or the error of measurement (s,) is a funda- 
mental and important characteristic of any test. If the error of meas- 
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urement can be reduced, the test has been improved. If any factors 
operate to increase the error of measurement, the test has been made 
poorer. Much of the effort in test construction, test revision, test 
analysis, and the precautions of test administration are for the purpose 
of decreasing the value of s,. 

Taking the square root of both sides of equation 17 gives 


(18) & = Vo? + 82. 


This is the familiar Pythagorean theorem: “The hypotenuse of a right 
triangle is equal to the square root of the sum of the squares of the two 
sides." s, and s, could be diagramed as the sides of a right triangle, and 
s, would be the hypotenuse. 

This relationship can be utilized to construct a simple computing 
diagram for the foregoing equation. Draw a series of concentric quarter 
circles, as illustrated in Figure 2, including the range of values of s; and 
Se likely to be found in the data being considered. Find s, and s, on 
From the point of intersection follow 


the vertical and horizontal axes. 
For example, the dotted lines 


the circle to either axis and read off sz. 
show that if s, = 4 and s, = 3, then s, = 5. 


7. Definition of parallel tests in terms of true score and 

error variance 

From a common sense point of view, it may be said that two tests 
are “parallel” when “it makes no difference which test you use." Tt is 
certainly clear that, if for some reason one test is better than the other 
for certain purposes, it does make a difference which test is used, and 
the tests could not be termed parallel. However, this simple statement, 
“it makes no difference which test is used," must be cast in mathemati- 
cal form before we can use it in any derivations. 

It would seem to be clear that if the true score of a given person (i) 
on one test is different from his true score on a second test, we cannot 
say that the two tests are parallel. In other words, if we designate the 
person by the subscript 7 and the two tests by the subscripts g and h, 


we can say that two tests are parallel only if 


(19) Tig = Ta. 


The true score of any person on one test must equal the true 
score of that person on the other parallel test. 
Equation 19, however, is not the only requirement for parallel tests. 


If the difference between observed score and true score is in general 
much greater for one test, it is clearly better to use the other test. And 
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we cannot say, “It makes no difference which test is used." Also it is 
unreasonable, in the light of what has been said previously, to require 
that Eig = Ej. This statement would contradict the definition of- 
equation 4, which says that the errors on one test correlate zero with 
the errors on another test. If for each person (2) the error score on one 
test equaled the error score on another test, the correlation between 
errors would be 1.00 and not zero. The closest we can reasonably come 
to defining the errors to be alike on parallel tests is to require that the 
standard deviation of the errors on one test equal the standard devia- 
tion of errors on the other test. Since this can be true when the correla- 
tion is zero, it does not violate any other assumption that has been 
made. Thus we may write the second equation defining parallel tests, 


(20) Se, = Sep: 
This equation may be stated in the definition: 
For two parallel tests, the errors of measurement are equal. 


Equations 19 and 20 will serve to define parallel tests. 

It will be noted that both equations 19 and 20 are stated in terms of 
hypothetical quantities. In other words, these equations do not pro- 
vide for testing the actual means, standard deviations, or intercorrela- 
tions of a set of tests and for determining whether or not they are paral- 
lel tests. Let us see what can be determined from equations 19 and 20 
regarding the parameters of the observed scores. 

If all true scores are identical, it follows that the means and standard 


deviations of true scores are also identical for parallel tests. We may 
write 


(21) My, — Mr, or T,— T, 
and 
(22) ST, = Sp. 


The correlation between two sets of identical scores is unity; therefore, 
(23) rrr, = 1.00. 
Applying equation 17 to tests g 


(24) 


and h, we see that 
2 2 

8x, = Sr, + Sp”, 

and correspondingly for test h 


(25) 8x," = Sp? zl Sp. 


t 
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From equations 20 and 22 we see that 
(26) sx = sy, 
From equations 21 and 9 it follows that 
(27) Mx,— Mx, or X,= Xj. 
The means and standard deviations of parallel tests are equal. 1 
We turn next to consideration of the problem of the correlation 


between parallel forms of a test. 


8. Correlation between parallel forms of a test 

For the present we shall define reliability as the correlation between 
two parallel forms of a test. The problem of determining whether or 
not two forms are parallel and of the best method of estimating the 
correlation between the two forms will be considered later. 

The correlation between two parallel forms of a test may readily be 
found by using the deviation score formula for correlation: 
Daeth 
NSgSh 
From equation 14 we may express the numerator of the right side of 
equation 28 as follows: 


(29) Exgtn = Ute + ee) (in + en). 


(28) " Tun = 


Expanding and removing the parentheses gives 
(30) Extn = Zigls + Blegen + Dineg + Degen. 
From the definitions of equations 3 and 4 we see that the last three 


terms in equation 30 are each zero. Since we are dealing with parallel 
tests, the true score on g equals the true score on h. Therefore, equation 


30 may be rewritten as 


(31) Za. = Di,” = Dey. 

We may divide both sides of equation 31 by N, obtaining 
Eren Dt,” 

82 CEA UM 


1]t should be noted that equations 26 and 27 may be interpreted as applying 
either to gross scores or to scores after transformation by one of the methods dis- 
cussed in Chapter 19. However, the set of scores (X,) and the set of scores (X4) are 
not parallel unless the means and standard deviations are about equal. Criteria of 


equality are discussed in Chapter 14. 
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From the definition of a standard deviation we see that 


Dret, h 3 
(33) = = sf. 
Substituting equations 26 and 33 in equation 28 gives ' g 
s? 
(34) Toy = 2^ 
Te 


We see from the reasoning used in developing equation 26 that, if we 
were dealing with several parallel forms, all the observed variances 
would be alike. That is, if we designate the forms by subscripts 1, 2, 3, 
and 4, 


(35) Bey = Bog = Bag = Sap 

Also, by the reasoning used in developing equation 22, we see that 
(36) Su = 54 = Sh = Sty 

From equations 34, 35, and 36 we see that 


(37) Tau, = Tana = Tuan = 8" = Tue 


All intercorrelations of parallel tests are equal. 


Equations 26, 27, and 37 show that parallel forms of a test should 
have approximately equal means, equal standard deviations, and equal 
intercorrelations.! "These are objective quantitative criteria for which 
a statistical test will be presented in Chapter 14. In addition to satisfy- 
ing these objective and quantitative criteria, parallel tests should also 
be similar with respect to test content, item types, instructions to 
students, ete. Similarity in these respects can as yet be determined 
only by the judgment of psychologists and of subject matter experts. 


9. Equation for true variance 
Multiplying equation 34 by s,? gives 


(38) BF = s. (true variance). 


Taking the square root of both sides, we have | 


(39) Se = 82V uas (true standard deviation). 


1 Means and standard deviations may be equal for the raw scores or for scores 
transformed by one of the methods suggested in Chapter 19. For transformed 
scores, it is necessary to determine the transformation equation from one set of data 
and to test for equality of means and standard deviations afte 
tion has been applied to a new set of data. 


the same transforma- 
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Equations 38 and 39 give the variance and standard devia- 
tion of the distribution of true scores in terms of the test reli- 
ability and the standard deviation of the distribution of ob- 
served scores. These equations may be derived by assuming 
only equations 1, 8, 20, and 26. 


10. Equation for error variance (the error of measurement) 
We may solve for the error variance by substituting equation 38 in 
equation 17, obtaining 


9 9 9 
(40) S) = Sr Trep F Se“. 
Solving equation 40 for the error variance (s) gives 


(41) s? = s — ru) 


(variance of the errors of measurement). 


Taking the square root of both sides gives the equation for the standard 
error of measurement, 


(42) Se = SV l — Ts, (error of measurement). 


Equations 41 and 42 give the variance of the errors of meas- 
urement and the standard deviation of these errors. They 
follow from the assumptions needed to derive equations 38 
and 39. The error of measurement is a fundamental con- 
cept in test theory and an important characteristic of a test. 


It was suggested by Kelley (1921) and Otis (1922b) that the error of 
measurement was an invariant of a test, that is, it did not vary with 
changes in group heterogencity. The equations given in Chapter 10 
are based on this assumption. 

The computing diagram used before to indicate the relationship 
Sei Vv se + se can also be used with some slight complication to 
compute the true standard deviation and error of measurement, given 
the standard deviation and reliability of the test. Figure 3 indicates 
how this computation is done. Draw the series of circles, as before, to 
indicate the relationship between true, error, and observed standard 
deviation. Radial lines can then be drawn to indicate a given reliability. 
Where each of the horizontal lines (indicating s;) intersects the large 
quadrant with radius 10, we have successively the points for which the 
reliability is 12, .2?, -++ .9?. Still finer subdivisions ean be drawn from 
the general rule that the reliability coefficient is the ratio of the true 
variance to the observed variance. By selecting the quadrant with 
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radius 10, the division consists in simply pointing over two places 
(standard deviation 10 gives variance of 100). For each point along 
this radius, the reliability coefficient is s,2/100. 

In order to use the diagram, find the radius corresponding to the 
reliability coefficient and the circle corresponding to the observed 


Se 
Abac for s,2 s yr- 
S$,—s,Vl-r 


Ficure 3. The true standard deviation, and error of measurement as functions of 
the test reliability and standard deviation. 


standard deviation. Then note the intersection of this partieular radius 
and circle. The y-coordinate of that point is the true standard devia- 
tion, the z-coordinate of that point is the error of measurement for 
that test. 

For example, the point marked with a circle is where the reliability 
is .64, and the standard deviation of observed scores is 7. In this 
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case the standard deviation of the true scores is 5.6 (7 X V64 ), and 
the error of measurement is the z-coordinate of the same point, 4.2 
(7 X V36): 

If we know the true and error standard deviations, it is possible to 
use the same diagram to read off the reliability and the observed stand- 
ard deviation. In this case, look up the true standard deviation on the 
y-axis, the error of measurement on the z-axis, find the point with these 
two coordinates, and then read the value of the radius through this 
point to find the test reliability, and the value of the circle through this 
point to find the standard deviation of observed scores. , 

In general we may describe this graph by pointing out that it is a 
combination of réctangular and polar coordinate systems. Any point 
in it represents four quantities, one on the x-axis, one on the y-axis (the 
rectangular coordinate system), one on the radius, and one on the 
circle (the polar coordinate system). Each point, which represents 
four quantities, is determined by any two of these quantities. There- 
fore, generalizing, we may say that given any two of the four quantities 
involved, it is possible to determine the other two. 


1l. Use of the error of measurement 

It should be noted that, in the development of the equations of this 
chapter, there is no reference to any particular type of frequency dis- 
tribution. It is assumed that the average error is zero, and that errors 
are uncorrelated with each other or with true score. All the equations 
of this chapter follow from these assumptions, regardless of the fre- 
quency distribution of true scores, error scores, or observed scores. 
However, it is not possible to make use of the error of measurement 
without some assumption about its distribution. This quantity is 
usually assumed to be normally distributed. Figure 4 gives a concrete 
illustration for a particular test in which the error of measurement is 
assumed to be 5 score points. Observed scores are indicated on the 
base line. A is the distribution of observed scores for all persons whose 
true score is 50 points. It will be noted that the mean is at 50, the in- 
flection points at 45 and 55 (50 plus and minus 5), and that all but a 
negligible part of the distribution lies between scores of 35 and 65 (50 
plus and minus three times 5). That is to say, for the group of persons 
whose true score is 50, over 99 per cent of the observed scores will lie 
between 35 and 65. 

We may also indicate a method for assigning a probability value 
to, the statement: “If a person’s observed score is 65, his true score 
probably lies between 50 and 80." Note that no probability value is 
given. We cannot say that for all persons whose observed score is 
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65, the probability is greater than .99 that the true score is between 50 
and 80. However, consider the statement: “This person's true score 
lies between 50 and 80." If the person's true score were known, the 
statement could, for a given person, be classified as true or as false. It 


A = distribution of observed scores for those with a true score of 50. 
B = distribution of observed scores for those with a true score of 80. 
(Standard error of measurement is 5.0 for both distributions.) 


A B 


10 20 30 40 50 60 70 80 90 100 110 
Observed score 


True score 
Sg 


30 
10 


20 30 40 50 60 70 80 90 100 110 
Observed score 


Ficure 4. Illustration of a significant difference between two test scores. 


would be possible to keep on following this rule of procedure. For 
each person (some with one observed score, and some with another) 
we could say: “This person's true score lies between Tr and Ty,” whee 
Tr, the lower limit for the true score, is found by subtracting three 
times the error of measurement from the observed score, and Ty, the 
upper limit for the true score, is found by adding three vis the Kork 
of measurement to the observed score. For each of the persons whose 
observed score is known, such a statement can be made. In addition 
the statement so made can be labeled true or false. For all the persons 
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in the distribution, it will be found that the statement regarding limits 
is true over 99 per cent of the time and is false less than three times out 
of a thousand. In other words, if all the cases are considered, a prob- 
ability can be attached to the truth or falsity of the statement that 
“true score is included within the specified limits." However, if we 
limit the statement to persons with any given observed score, no asser- 
tion regarding probability can be made. 

The situation shown in the upper part of the diagram for persons 
with a true score of 50 and of 80 can be generalized if the correlation 
scatter plot is given. The lower part of the diagram shows the scatter 
plot of true score against observed score. The heavy line marks the 
cases in which observed score and true score are equal. Two lighter 
lines are drawn on each side of the heavy one to indicate those cases in 
which the observed score is equal to the true score minus three times 
the error of measurement, and in which the observed score is equal to 
the true score plus three times the error of measurement. For true 
scores of 50 and 80, again we can read the limits within which over 
99 per cent of the observed scores will fall. We can also see that the 
observed score of 65 could reasonably be from any distribution where 
the true score was over 50 or under 80. Likewise, we may pick any 
observed score, such as 60 in the diagram, and see that such a score 
might have arisen from any true score over 45 and under 75. That is 
to say, take three times the error of measurement, or 15; then add 15 
to 60 and subtract 15 from 60, obtaining the upper and lower limits of 
75 and 45. Thus if we know the error of measurement of a test it is 
possible, for any given observed score, to specify limits within which 
the true score lies. We can also say that such a statement is true for 
99.7 per cent of the cases (and false for 0.3 per cent of the cases) in the 
entire distribution. However, no such probability statement can be 
made which applies to the group of persons making any specified ob- 
served score. Distribution B shows the same information for persons 
whose true score is 80. Since it is assumed that the standard error of 
measurement is constant regardless of true score, distribution B also 
has a standard deviation of 5 points. For persons whose true score is 
80, the observed scores, in over 99 per cent of the cases, will lie between 
65 and 95 (80 plus and minus three times 5). It will be noted that if a 
person’s observed score is 65, he might reasonably be either a top- 
scoring person from distribution A, or a very low-scoring person from 
distribution B. In other words, if a person’s observed score is 65, his 
true score could reasonably be as low as 50 (65 minus three times 5), or 
as high as 80 (65 plus three times 5). His true score might also reason- 
ably have been any value between 50 and 80. However, if a person's 
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true score is below 50, there is considerably less than two chances in a 
thousand that his observed score would be 65 or higher. Conversely, 
if a person’s true score is above 80, there is considerably less than two 
chances in a thousand that his observed score would be 65 or less. We 
say then that for a person whose observed score is 65, reasonable limits 
for his true score are 50 to 80. 

It should be noted that, although we can assign reasonable limits for 
the true score, we cannot say that for all persons who score 65, over 99 
per cent will have true scores between 50 and 80. In general no prob- 
ability statements can be made to apply to all persons who make a 
given observed score. We can only make the statement the other way 
around. For all persons with a given true score, the probability is over 
.997 that the observed score will lie within plus or minus three times 
the error of measurement from that true score. Likewise, for all per- 
sons with a given true score, the probability is less than .003 that the 
observed score will lie outside the range given by the true score plus 
and minus three times the error of measurement. 


For persons with any given observed score X;, reasonable 

limits for the true score T; may be taken as 

(43) X; + CS: > Ti > Xi — Cse, 

where c is taken as equal to 2 or 3 and S= 8; V 1 — fus 

The error of measurement, may also be utilized to determine reason- 

able limits for the difference between the true scores of two persons. In 
this case we utilize the difference between two Scores, t; — xj, and the 
Standard error of that difference Se—e To write the formula for this 
error, we use equation 14 and write 
(44) Ti — tj = l; — lj + (ei — ej). 
"The terms in parentheses indicate the error. 
observed difference from the true difference is 


(45) 


Thus the variation of the 
indicated by 


Z(e;— ej? = Xe? + Ze? — 2Xe;e;. 
From equation 4, we sce that the 1 
equation 41 in equation 45, we h 
(46) 


ast term of equation 45 is zero. Using 
ave 
Zle: — oj? = 2Ns?(1 — Taren): 
Dividing by N and taking the sc 

l K {uare root, to get the s ard error 
the difference $,5,—, We have x dique S 


(47) Sei—ej = 8,V2 Vl1-—r, 


Tezh” 
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Illustrating again, using the distribution of Figure 4, the standard 
error of which is 5, we have the standard error of the difference of two 
scores as 54/2. Figure 5 illustrates the frequency distribution of ob- 
served score differences for persons with true score differences of —25, 0, 
and 4-25. In each ease the distribution is shown as extending 3 X 54/2 
above and below the true score difference. 


Observed score 


*40 


Irue score 
o 


=20 


—60 —40 —20 0 +20 +40 +60 
Observed score 


Figure 5. Illustrating the standard error of a difference of observed scores. 


Again it should be noted that in developing the equation for the 
standard error of a difference, no assumptions were made regarding the 
distribution of errors. However, in order to utilize this error to obtain 
reasonable limits for the value of the difference between true scores, 
some assumption regarding the frequency distribution of errors is needed. 
In Figure 5 the usual assumption of a normal distribution of errors is 
made. 

As before, we may generalize for all possible distributions as shown in 
the lower part of Figure 5. It shows that if the observed score differ- 
ence is zero, the true score difference may be as high as +3 X 54//2 or 
as low as —3 x 54/2. If the observed score difference is as large as 22, 


we B an ISU20 


RAD M 


Aooesrior 
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the entire range from 22 +3 X 54/2 to 22— 3 x 54/2 does not 
include zero. Hence it is not reasonable to assume that there is no dif- 
ference between the two true scores represented by these observed 
scores. In such a case it has become customary to say that there isa 
significant difference between the two scores. When the difference 
between two scores is less than 3 X 54/2, the range of possible true 
score differences will include zero; and it is conventional to say that 
there is not a significant difference between the two scores. This means 
that zero difference is one of the possibilities. 


For persons with any given observed score difference (v; — E77) 


reasonable limits fog the difference of true scores (t; — 1j) may 
be taken as 


(48) $;— tjt 08/2 1; — lj > ti — vj— cse V2, 


where c is taken as equal to 2 or 3, and s, — s, V1 — uz 
If these limits include zero, there is no significant difference 
between x; and x;. If the limits are both positive, or both 
negative, there is a significant difference between x; and zj. 


It should be noted that we are discussing the comparison of single 
cases. For us to be certain that Mr. A's score is different from Mr. B's 
score there must be a very large difference between the two scores. 
However, when we are setting up a selection policy that is to be used 
on several hundred cases, it is legitimate, for example, to accept every- 
one with a score of 76, or higher, and reject everyone with a score of 
75 or lower. The average true score of a hundred persons scoring 76 
will be higher than the average true score of a hymdred persons scoring 
75, so that in the long run better persons will be accepted and poorer 
ones rejected. However, the magnitude of the err 
in a single case or the percentage of errors that w 
number of cases is indicated by the error of measu 


or that may be made 
ill be made in a large 
irement. 


12. Correlation of true and observed scores (index of reliability) 

In order to obtain the correlati 
we begin with the basic equation 
of observed and true scores as 


on between true and observed scores 
for correlation, writing the correlation 


Ext 
(49) m= 
Ns,s, 
Substituting equation 14 in equation 49 gives 
D(t + et 
(50) ES Z+ 


Nszs; 
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Removing parentheses, we have 
“an DP + Xte 
51 [OE a dE 
Nsz5; 
Dividing each of the terms in the numerator by the N in the denomina- 
tor gives 
2 
SC H+ TteStSe 
(52) = 
S281 

Since the correlation between deviation scores is identical with the 
correlation between gross scores, we can see from equation 3 that the 
second term of the numerator in 52 vanishes. Dividing both numerator 
and denominator by s, gives 

St 
(53) got, at 

Sx 
Substituting equation 39 in equation 53 and canceling s, from numerator 
and denominator gives 


(54) Trt = Mun (index of reliability). 
The foregoing formula was given by Kelley (1916). 


The correlation between observed scores and true scores as 
given by equation 54 is known as the index of reliability. 
The test validity must always be less than the index of reli- 
ability. 


13. Correlation of observed scores and error scores 
Just as in considering the correlation between observed and true 
scores we begin here with the basic deviation score formula for correla- 


tion, 
Dre 


(55) Tze = 


D Ns. 
As before, substituting equation 14 in equation 55 gives 
D(t + ee 
(56) Tre = E 7» 
Following the same procedure as in the preceding section, we remove 


parentheses, obtaining g 
Bie + Ze* 


(57) ae N345, 
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Ed 
uc YA 


ES 


VI= r= ti 


Relationship between test reliability, index of reliability, and correlation 
of observed and error scores. 


FIGURE 6. 


Dividing each of the terms in the numerator by the N in the denomina- 
tor gives 


2 
TteSt8e + Se 
(58) et a eel ue 
SzSe 
Again from equation 3 we sce that the first term in the numerator 
vanishes, and we may then divide numer 
obtaining 


(59) ME 


ator and denominator by Se, 


Substituting equation 42 in equation 59 and 
denominator by s, gives 


(60) Tre = V1 =r. 


dividing numerator and 
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Equation 60 gives the correlation between observed scores and 
error scores as a function of the reliability coefficient. From 
equation 42, we see that rr, is equal to the standard error of 
measurement when the test standard deviation is taken as 
unity. In Chapter 19 it is shown that the standard devia- 
lion of standard scores is equal to unity. Thus equation 60 
gives the error of measurement for a standard score. 


The relationship between the reliability coefficient, the index of re- 
liability, and the quantity V1 — rz, is shown in Figure 6, which is 
similar to Figure 3 showing the relationship between reliability, stand- 
ard deviation, and error of measurement. 

The x- and y-axes are scaled by tenths, or finer, from zero to one. 
One of these axes gives the index of reliability, and the other the cor- 
relation between observed and error scores. A quadrant is drawn with 
radius unity, and is scaled in terms of the reliability coefficient. Any 
point on this quadrant then represents a given reliability, index of re- 
liability, and correlation between observed and error scores. With 
any one of the values given, the other two can be read from the graph. 


14, Summary 
The material in this chapter has been based upon the following def- 


initions: 
1. Definitions, from elementary statistics, of the mean, standard 
deviation, and correlation, 


ZX 
Mx = N ^ 
zs 
Sai mw! 
Day 
e N5;8y 


2. Definition of the relationship between observed, true, and error 
'score, 
(1) Xi = Ti; + E, 
or its equivalent 


(14) a=t+e. 
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3. Definition of random errors, 

(2) Mz = 0, 

(4) TEE: = 0, 

(3) rre = 0. 
4. Definition of parallel tests, 

(19) Ti = Tin, 

(20) . Seg = Saye 


From these definitions the following equations have been derived: 


1. Since the true score for each individual is the same on parallel 
tests, it follows that 


(21) Mr, = Mm, 
(22) Sp, = ST 
(23) Try, = 1.00. 


2. From the two equations defining parallel tests, it is shown that 
the observed scores on parallel tests must satisfy the following 
characteristics: 

(27) Myx, = Mx, 
(35) &, = Gy 
(37) 


Tree = Tra 


3. The variance of true scores, observed and error scores are related 
by the equation 


qm $2 = 87 + 87 
or 
(18) Sx = V sé cse. 


4. It has been shown that the mean true score and the standard 


deviation of both true and error scores can be expressed in terms 
of observable quantities as follows: 


(9) . Mr = Mx, 
(39) St = SV Tran 
(42) Se = $V1— 


ck (error of measurement). 
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5. It has been shown that the correlation of observed scores with 
true and error scores can be expressed in terms of observable 


quantities as follows: 


(54) tie = Ir (index of reliability), 
(60) Tre = Wl = fua. 


Among the foregoing quantities, the error of measurement is the most 
significant and the most generally useful. Illustrations have been given 
of how the error of measurement can be used to assign limits between 
which the person’s true score is very likely to be found. 


Problems 
1. Give the error of measurement, standard deviation of true scores, correlation 
between observed and error scores, and the index of reliability for each of the follow- 


ing tests: 


T Number Me: Standard | Relia- 
est of Items | ^ cans | Deviation | bility 
A 50 | 100 15 .91 
B 100 211.6 25.7 .84 
Cc 80 57.4 11.3 .78 
D 700 361.9 76.5 .87 
E 200 127.4 21.9 .76 


2. Assume a normal distribution of error scores. Give the true score limits (approx- 
imately 0.3 per cent level) for persons making each of the following scores: 


(a) A score of 115 on test A. 

(b) A score of 211 on test B. 

(c) A score of 31 on test C. 

(d) A score of 500 on test D. 

(e) A score of 100 on test E. 

what is the minimum difference between the observed 


3. For each test A through E 
Il give reasonable assurance that they do not have 


scores of two individuals that wi 
the same true scores? 


NE 


Fundamental Equations Derived 
from a Definition of True Score 


1. Definition of true score 

Again let us begin with basie equation 1 of Chapter 2 (X = T + E) 
and put it in the form 
(1) E-XC-T. 


In other words, the error is defined as the difference between true score 
and observed score. Then, if “true score" is defined, the error can be 
determined. True score is defined as 


A i 

(2) T; = Jim" KU 

that is, the true score for a given person (1) is the limit that the average 
of his scores on a number of tests approaches, as the number of parallel 
tests (K) increases without limit. These tests are designated by the 
subscript g, which varies from 1 to K. 


2. Definition of parallel tests 


In addition to the definition of true score, a definition of parallel 
tests is needed. Instead of defining parallel tests in terms of true and 
error score (as we did in the chapter immediately preceding), and then 
deriving the observed score characteristies of parallel tests, we shall 
define parallel tests this time in terms of observable characteristics. 

Again we may begin with the basic definitio 
if "it makes no difference Which one is used." 
requires that the means and standard 
subjects be equal regardless of which t 
equational form, we have 


Clearly such a definition 
deviations of a given group of 
est is used. Expressing this in 
(3) Rie Ree is ey zocy 
28 


n that tests are parallel 
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and 
(4) 31 = S =g 70d — Ye Su. 
Also it is clear that 
(5) ma S718! So 4 = Tag =! Naa: = M94, 
since if one of these correlations were higher than the others, the “pre- 
diction” from one of these tests to the other would be higher than could 
be obtained by using other combinations of two tests. “It makes no 
difference" which test is used only if all the intercorrelations between 
parallel forms are equal. In order to be useful, this correlation must be 
high for parallel tests. "That is, the standard deviation of the true 
scores must be considerably greater than the standard deviation of the 
error scores. 

The basic definitions used in this chapter are given in equations 1 to 5, 
inclusive. From these equations we can, without any further assumptions, 
derive all the equations given in the preceding chapter. 


3. Determination of mean true score 
The mean true score is obtained by summing equation 2 over the 
number of cases (N) and dividing by N. 


SE ty 


(6) Mr = a Eo. 
Since the order of summation makes no difference, we may substitute 
M, for = X;(1/N) and write 

i=1 a ^ | 
(7) Mr = E (K > e). 


By the assumption of equation 3 all the means are equal for parallel 
tests. Therefore, in the summation, M, or X, may be treated as a 
constant, so that EM, = KM;. Substituting this value in equation 7 
gives 
(8) Mr = M, = X,. 
The mean observed score is equal to the mean true score. 
Equation 8 is based only on the definition of equation 2 and 
the assumption of equation 3. 
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4. Determination of true score variance 
From the definition of a standard deviation, we may write 
N 
py (T: = Mr)? 
i=l 
(9) sr? = 


N 


In order to simplify the term in parentheses, we can make use of equa- 
tions 2 and 8, obtaining 


K 
È Xi 


(10) T; — Mr = E — My. 


We omit here the proviso, stated in equation 2, that the true score is 
the limit of the average of a large number of tests as *K approaches 
infinity." This proviso is omitted for convenience in the following 
equations. It will be introduced explicitly again in equation 18. It 
should be noted that nothing is done in the meantime, in equations 11 
through 17, to invalidate carrying the assumption through. 
In order to simplify the notation, let us use ¢ to designate a deviation 

score and put all the right-hand term over one denominator, obtaining 

K 

> X4 - KM, 
(11) ja 

K 

The numerator of equation 11 may be written as shown below and the 
equation expressed as follows: 


K 
È (X4 — Mj) 
(12) ay eu 
K 
We may substitute « for X — M and write out the numerator to avoid 
the summation sign, obtaining 


(13) pee eg Eo ae 
K 


yn value of t in equation 13 may be substituted for T — M in equa- 
ion 9: 


N 


P» (ra + xis + xi +++ + xu) 
(14) eas 


NK? 
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Expanding the numerator and omitting the limits on the summation 
sign, since all summations are over N, we can write 
2 qos. 2 3 2 
Day? + Say? + Erg? Rec VaR? + Eart + Brita bec 
+ OXrgrk-i 


15) sf = 
ulis NK? 


We may substitute Ns? for Xa?, and Nri2s182 for Dxxz, and correspond- 
ingly for other products. This substitution gives 
Ns? + Ns? + Ns? ec Nsg? + Nriasiso 
+ Nri3sis3 cc Nrg(K-1)8K8SK-1 
NK? 


2 


(16) Sp 


We may divide the numerator and the denominator by N. Since from 
equations 4 and 5 we see that, according to the definition of parallel 
tests, all standard deviations are equal and all intereorrelations are 
equal, we may substitute s¿è for each of the variances and rg for each 
of the intercorrelations.! These substitutions result in 


Ks + K(K — Urge 
i K? 


2 


(17) St 


Dividing the numerator and the denominator through by K and sepa- 
rating terms gives 


2 Se 1 2 
(18) s” = K +(1-— É TiS « 


As was noted in equation 10, we neglected to specify that K should 
approach infinity, as Was done in equation 2 defining the true score. 
If we now introduce this part of the definition of true score, 


(19) s? = ras (true variance). 
If we take the square root of both sides of equation 19, we have 
(20) LI Teh (true standard deviation). 


Equations 19 and 20 give the variance and standard devia- 

tion of the distribution of true scores in terms of the test reli- 

ability and the standard deviation of the distribution of ob- 

served scores. These equations may be derived by assuming 

only equations 2, 4, and 5. 

! Note that rj, is the correlation between any two parallel tests (g and h), and 

therefore is a reliability coefficient. It is identical with 7,2; which was used for the 
reliability coefficient in Chapter 2. 
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It should be noted that equations 19 and 20 are the same as equations 
38 and 39 given in Chapter 2 for true variance and true standard devia- 
tion. However, in the preceding chapter these theorems were derived 
from assumptions about “random errors" and a definition of “parallel 
tests" in terms of true and error scores. In this chapter the same 
theorems are derived from a definition of “true score” (equation 2) 
and from a definition of parallel tests that depends upon observed scores. 


5. Correlation of true with observed scores (the index 
of reliability) 


Using the usual formula for correlation and the definition of true 
score as the average of scores on a large number of parallel tests (see 
equation 2), we may write 


Zx(m + te +--+ 2K 
(21) TEC ur tK) (E a) 
NKsısı 


Removing the parentheses gives 


Dr? + Erit + Dayxg +-+-+ Drt 


(22) r. (K > e). 
E NKs,s ^ 
Dividing both numerator and denominator by N gives 
51? + r125189 + 14381! ere 6h 
(23) TS F 1128182 + natis Teo 71x818K (K — ©), 
Ksis, 
Canceling sı from numerator and denominator, we have 
81 + T1282 + 1383 +--+ + Tu SK 
(24) Ta = l i 13°3 + T1KSK (K " wa) 


Ks, 


It will be noted that there are K — 1 terms involving r. Since the 


sum of a series of terms is equal to the number of terms multiplied by 
the average term, we may write 


(25) i TEDNA T a DK- Dr (K => o), 
Ks, 


where rs equals the sum of the terms TghSg divided by K — 1. Substi- 
tuting for s, its value from equation 20, we have 


(26) eee UE Drafs 
KspV Ten Seta 


According to equations 3, 4, and 5, one of the rs produets may be 


Tzt 
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substituted for the average of an infinite number. Making this substi- 
tution, dividing numerator and denominator by sg, and separating the 
terms gives 


1 1 
(27) Tot = XV zs ( = =) Veh (K > o). 


If we let K approach infinity, equation 27 becomes 
(28) ra = Wren (index of reliability), 
where rga is the reliability of the test. 


The test validity must always be less than the correlation be- 
tween observed scores and true scores, or the index of reli- 
ability as given by equation 28. 


Again it may be noted that this equation is identical with the equation 
given in the preceding chapter for the correlation between true and 
observed scores. However, it is derived from one set of assumptions 
in Chapter 2 and from another set in Chapter 3. 


6. Average error 

Using equation 1 for errors, we may sum it over persons from 1 to N. 
Since all summations are over persons from 1 to N, no ambiguity will 
arise if subseripts are omitted and the limits of summation are not 


indicated. We write 

(29) ZE = 2X —ZT. 
Dividing equation 29 by N, we have 

(30) Mg = Mx — Mr. 
Substituting equation 8, we see that 

(31) Mg = 0. 


The average error is zero. 


7. The standard deviation of error scores (the error of 
measurement) 

Using the usual formula for standard deviation, we may write from 
equation 1, noting that the differences of deviation scores equal the 
differences of gross score, 

5 2(5—0* 
(32) Se = a d 
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Expanding the numerator, we have 


2 
Er? + De? — Wat 


(33) so = N 
Dividing through by N gives 
(34) sj) = 827 + 87 — asist 


Substituting equations 20 and 28 in equation 34 gives 


(35) s? = s + sry — 2V rasis; V rg. 

Combining terms in equation 35 and simplifying, we get ! 

(36) 87 = 8.7 — sr, 

or 

(37) Se = s? (1 — ry) (variance of the errors of measurement). 


Taking the square root of both sides of equation 37, we have the usual 
formula for the error of measurement, 


(38) Se = s, V1— rg, 


where s; is the standard deviation of the distribution of gross scores, and 
Tgn is the test reliability. 


(error of measurement). 


Equations 37 and 38 are the same as equations 41 and 42 in 
Chapter 2. In this chapter the error of measurement is de- 
rived from the assumptions of equations 1, 2, 4, and 5. 


8. Relationship between true and error variance 
By adding equations 37 and 19 we obtain 
(39) 


8? s = Tense + s — rg). 
If we factor out s,?, we have 

(40) 8? + sè = s (ren + 1 — rer), 
which equals 


(41) P 


= uf 2 
S, = Ss + Se". 


The true variance plus the error variance is equal to the ob- 
served variance. 


1 Note that for parallel tests sz, Sg, and sj may be used interchangeably. 
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Again it may be pointed out that the last three theorems involving 
error variance were proved from a different set of assumptions in 
Chapter 2. 


9. Correlation of errors and observed scores 
Using the usual formula for correlation, we can write—from equa- 
tion 1, 


(42) Fe Zale — 2) 


N5,8, 
Expanding and factoring out N as before gives 
2 
Sg” — TgiSgS 
g gtg’ 
(43) ‘ex = Ee, 
Soda 

Factoring out sg and substituting from equations 20, 28, and 38 gives 

Sg — V l'gnSg V Tgh 
(44) QS ——— 

SEV 1 — Ten 


Dividing through by s, again, we have 

(45) ee 

E Tex = , 
t H B 1 Bi Teh 

which is equivalent to 


(46) Ter = V 1 — Teh 


where rga is the test reliability. 
Equation 46 is identical with equation 60 of Chapter 2. It 
is equal to the standard error of measurement when the test 
standard deviation is unity. Since the standard deviation of 
standard scores is equal to unity (see Chapter 19), equation 
46 is sometimes referred to as the error of measurement for 
a standard score. 


10. The correlation between true score and error score 
This correlation is derived by exactly the same procedure as in the 


last section. We first write 
t(x — t) 


47 = E 

un) Ep Nsis, 

Expanding equation 47 and dividing by N, we have 
TixSi8x — se? 


(48) Tet = 
StSe 
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Substituting the values of fiz, s;, and s? from equations 19, 20, and 28, 
we have 
2 PP 9 

Sg^ V Teh Virg, — Sg feh 
(49) — EL RA (sg = Sz). 
5848, 
it can be seen that the numerator is equal to zero, and therefore 
(50) fa = 0. 

The correlation between error scores and true scores is zero. 


This theorem, which is proved from the definitions of true score and 
parallel tests, is identical with one of the assumptions made in defining 
random error in Chapter 2. 


11. Summary 

In this chapter we began with a definition of true score and a definition 
of parallel tests in terms of observed scores. From these definitions the 
fundamental theorems of test theory were derived. By comparison we 
see that they are the same as those in Chapter 2, except that a different 
set of equations was chosen for the assumptions. 


First error was defined as the difference between the observed and 
the true score: 


(1) E=Xx-T. 


True score of a given person was defined as the average of his scores 
on a number of parallel tests, as this number increases without limit. 
K 


(2) T,= lim £d 
K 


Parallel tests were defined as tests with equal means, standard devia- 
tions, and intercorrelations: 


(3) Eom E uu 
(4) 8 = 8 =---= SK, 
(5) WIS) ges" === ED 


Another way to state the definitions of equations 3, 4, and 5 is to 
say that for a group of parallel tests we assume that a reasonable approxi- 
mation is obtained if we substitute the mean of a single test for the mean 
of the mean of all tests. In like manner, the variance of a single test 
furnishes a reasonable approximation to the average of the variances of 
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an infinite number of tests; and the covariance of a single pair of tests 
furnishes a reasonable approximation to the average of the covariances 
of an infinite number of parallel tests. This set of assumptions is neces- 
sary in order to enable us to substitute actual values in the formulas 


derived. 
The true score mean, variance, and standard deviation were found 


to be 
(8) Mr = My, 
(19) SË = Tense”, 


(20) St = Sg V Tgh- 


It was proved that the error score mean, variance, and standard devia- 
tion were 


(31) Mg = 0, 
(37) s? = s" — rg), 
(38) S, = s," V1 — ah (error of measurement). 


The intercorrelations of observed, true, and error scores were shown to be 


(50) Tte = 0. 
(46) tee = V1 — rg, 
(28) Ti = Vren (index of reliability). 


It was also shown that 
(41) Seer ee 

From the definition of true score we see that the score is the same 
regardless of the particular few tests a person takes, and consequently 
the true variance for a number of parallel tests is the same. Since the 
observed and the true variances are the same for each of a set of parallel 
tests, it follows that the error variance of each of the tests should be the 


same. That is, we may write 


Tig = Tim 
$i = Stw 
GS 


These characteristics of parallel tests were assumed in Chapter 2, and 
those given in equations 3, 4, and 5 were derived. 
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The set of equations derived in this chapter are identical with those 
derived in Chapter 2. It has been shown that the fundamental equations 
of test theory can be derived either from a definition of error and a 
definition of parallel tests in terms of true and error score or from 
a definition of true score and a definition of parallel tests in terms of 
observed scores. 


Problems 


. Write the equation corresponding to each of the following assumptions: 


1 

A. The true score is the difference between observed score and error score. 
B. The average error score for a large group of persons is zero. 

C. True scores and error scores are uncorrelated. 
D 
E 


. Error scores on one form of a test are uncorrelated with those on another form. 
. Parallel tests have identical means. 


F. Parallel tests have identical standard deviations. 


2. Using only the necessary ones of the foregoing assumptions (and no additional 
assumptions), derive each of the following: 


(a) What is the value of the average true score? 

(b) The observed variance is the sum of true and error variance. 
(c) Find the value of the true variance. 

(d) Find the value of the error variance. 

(e) Find the correlation between observed and error scores. 

(f) Find the correlation between true and observed scores. 


Note: Work each of the foregoing six derivations independently. At the beginning 


of each of the six derivations, give the assumptions from the list (A-F) that are 
essential for that, particular derivation. 


3. By using only equations 1, 20, and 28, prove that the correlation between errors 
on test g and errors on test h is zero if g and h are parallel tests. 


4 


Errors of Measurement, 
Substitution, and Prediction 


1. Introduction 

The commonly used and most generally useful measure of “test error" 
is the "error of measurement" defined in Chapters 2 and 3. It is the 
standard deviation of the distribution of differences between observed 
score and true score. However, there are other possible measures of 
test error that are useful for certain purposes. These measures will be 
considered in this chapter. The four different types of error are defined 


as follows: 
e-r-—í 


St 
es t= Tal S 
Sz 


d = tı — 15, 


$1 
d = zı — T12 | —] 2. 
S2 


Stated in words, these four types of error are: 


The difference between true and observed score. 

The error made in estimating true score from observed score. 

The difference between two observed scores on parallel tests. 

The error made in predicting one observed score from the score on a 


parallel test. 


These different measures of error are presented by Kelley (1927). 
The third type of error listed above is the simplest and most direct 


measure of error, so let us consider it first. 
39 
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2. Error of substitution ! 


Here we define the error as the difference between two observed scores 
on parallel tests, that is, 


(1) d = x — 2». 


This definition of error applies if we are interested in considering the 
possible differences between the results of one investigator using a given 
test and another investigator using a parallel form of the same test. 

In order to obtain the standard deviation of these difference scores, 
which is the standard error of substitution, we get the standard deviation 
in the usual way, by squaring, summing, and dividing by N. This gives 


Id? Ehr — x 
NT N 


(2) 


We may substitute s4” for the left-hand member and expand the right- 
hand member, obtaining 


Day? Dr? = DE aXe 


3 ^- 
@) "TM.BREX N 


Since the first two terms are variances and the last one a covariance, 
we may write 


(4) sd? = sj? + 89? — 2ris5455. 
Since standard deviations of parallel tests are equal, we may write 
(5) s? = 2s*(1— 712), 


where s4? is the error of substitution or the error made in substituting a 
score on one test for a score on a parallel form, 
5? is the observed variance of the test, 
719 is the test reliability, or the correlation of two parallel forms. 


Taking the square root of both sides of equation 5 gives 
(6) sa = sı V2(1 — rig) 


The error made in substituting a score on one test for a score 
on a parallel form is given by equation 6. This is also the 
standard error of a difference for the case in which the two 
standard deviations are alike. 


1 This term was introduced by M. W. Richardson while teaching at the University 
of Chicago. . 
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3. Error in estimating observed score 

Instead of saying that we substitute the score on one test for the 
score on a parallel form, we can use the ordinary "error of estimate" 
and compute the minimal error that can be made in predicting the score 
on one form from the other form by using the least squares regression 
equation. As indicated in the introduction to this chapter, we write 


$1 
(7) d = ti — rie 5) Xo. 
s S2 


As before, we write the variance of d, noting that since sı = s2, the 
term in parentheses is unity and may be omitted. 
Ed? Dhr, — nen) 


(8) IS N 


Expanding as before and substituting sa? for the left-hand term, we have 


9 2 DN r, 
0) 3 Dr? ri Dr E 2ryo Drit 
a u N N N 


Equation 9 can be rewritten as 
ES eis 
(10) sist F 19782" — 27 2°S182- 


Since the variances of parallel tests are equal we may write 


(11) sa? = 81°(1 -n2) 


Taking the square root, we have the final equation 


(12) Sa = s VÀ - n» 


Which is the usual standard error of estimate. 


Equation 12 gives the error made when the regression equa- 
tion is used to estimate the scores on one test from scores on 
a parallel test. j 


It should+be noted that equation 12 is the correct one to use if the 


regression equation has been used to estimate scores on a parallel form 
and we wish to determine the error involved. Equation 6 is the correct 
one to use if scores on one test are assumed to be equal to those on a 


Parallel test without use of any regression equation. 


4. The error of measurement 


The error of measurement can be 
will be considered in the next chapter. 


interpreted in several ways, which 
We shall consider here only one 
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interpretation, namely, the error of measurement is the error made in 
substituting the observed score for the true score. We wish to assign 
each person a true score, and instead we assign the observed score. The 
difference between these two scores is the error of measurement. Deriva- 
tions of this error of measurement have been given in preceding chap- 
ters; however, one will be repeated here. 

One of the basic assumptions of test theory is that 
(13) ride. 
Squaring and summing both sides gives 
(14) Za? = D(t + oy". 
Expanding the term in parentheses, we have 


(15) Ea? = DP + Ee + Wie. 
Dividing through by N we can write this equation in terms of variances 
and covariances as follows: 


(16) s? = s? + s? + 2rusis,. 


Since one of the fundamental assumptions in the definition of error is 
that it correlates zero with true score, we may omit the last term and 
write 

(17) ay = s? + se. 


From the previous discussion of true variance we see that rers is equal 


to the true variance. Therefore, substituting this for s? and solving 
for s?, we have 


(18) sj = s — s, 

which may be written 

(19) s? = s — Tex). 

Taking the square root of both sides, we have 
(20) Se = S, V 1 — ras, 


which is the formula previously given for the error of measurement. 


! In Chapter 2, the symbol r;,7, was used for the reliability coefficient to emphasize 
the fact that it was the correlation between forms g and h of a test. Similarly, in 
Chapter 3 the subseripts g and h were retained so that a sum of reliability coefficients 
could be indicated. When we are not emphasizing the correlation among various 
parallel formis, it is convenient to designate the reliability coefficient by repeating a 


subscript. Thus rz; is the reliability of test x; ryy, the reliability of test y; rn and 792, 
the reliability of tests 1 and 2, respectively; ete. 
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The error of measurement is the error made in substituting 
the observed score for the true score. 


5. Error in estimating true score 

In this chapter we have considered the error made in substituting the 
score on one test for the score on a parallel test. Also we have shown 
that the error made is smaller if we use the score on the first test to 
predict the score on the second test and then obtain the error of esti- 
mate. The error of measurement can be interpreted in several ways; 
one way is to regard it as the error made in substituting the observed 
score for the true score. Also it is possible to ask what error is made in 
edict the true score from the observed score. In order 


attempting to pr Ob 
t up the usual prediction equation: 


to obtain this error, we se 


(21) + §=ra (3) f$, 


e of the true score. The difference between 


where î is the predicted valu t 
is the error. This may be written 


the actual and predicted true score 


=t—Ta\—)- 
(22) e=t—Txt üs 


The standard deviation of e or the usual error of estimate then is 
a 


RE] 
(23) Se = 3V1 — Trt. 
Since s, = s, V rer and Tx = Vraz as was shown in Chapters 2 and 3, 
p= ae X 


we may write . 
Sz V Trz V 1 — "zz 


(24) Se = 
g the best fitting regression equation 


The error made in usin n eg 
to predict the true score from the observed. score is given by 


equation 24. 
It may seem paradoxical at first to note that se = 0 if Tes =0. How- 
e in equation 39 of Chapter 2 that saV Taa is the standard 
s. Thus se is always some fraction of the true 
tion of the equation s: = Ss V r;, we see 
= 0; hence any fraction of it also is zero. 


ever, we se 
deviation of the true score 


standard deviation. By inspec 
that if the reliability is zero, $: 


6. Comparison of four errors 
The relative magnitude of these 
The relationship can easily be seen 


four errors is shown in Figure 1. 
if the expression 1 — r? is factored 
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Error + standard deviation 


4 2 3 4 5 6 oft 8 9 1.0 
r 
Figure aris 
SURE l. Comparison of errors of measurement, prediction, and substitution. 
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into (1 — r)(1 + r). The four errors arranged in order from smallest 
to largest may be written as follows: 


B, m8 V I te Vra 

$ = & V1 — re. V 1, 

si m as V L— fas V T Tas 
RE HESS T 


These terms are written so that they are identical except for the last 
factor. Since the reliability coefficient must be between zero and one, 
we shall always have rzs < 1 < 1 + rz < 2. Thus for any given set 
of data we shall always find that se < Se < Sa < Sa- 


T. Summary 
Four different sorts of test error have been considered. Two of them 


are what might be called “errors of substitution”; one is the error in- 
volved in substituting a score on one test for a score on a parallel test, 
the other is the error involved in substituting the observed score for 
the true score. This latter error is the most commonly used one and is 
termed the error of measurement. 

Corresponding to each of these errors of substitution is an “error of 
estimate.” One is the ordinary error of estimate, which is the error 
made in estimating score on one test from score on a parallel test. The 
other is the error made in estimating the true score from observed score. 
This latter one is almost never used, since no practical advantage is 
gained from using the regression equation to estimate true scores. 


The equations for these four errors are 


(1) d= Tı — Tə, 

(7) d = tı ms 

(13) e=r-—t, 

(22) e=t- m, 
while the corresponding standard deviations are 
(6) sa = 8 V 2(1 — 25); 
(12) sg = 8 VI Tar, 
(20) Se = SV — rus 
(24) Se = seWron(1 — ras). 
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Problems 


1. Give each of the four error indices for each of the following five tests. 


Test Number Means standard Ep 
of Items | ~ Deviation | bility 
A 70 40 10.0 .84 
B 50 35 5.0 .72 
Cc 400 251 55.7 .93 
D 200 110 25.8 .80 
E 150 60 15.2 78 


2. Read the Douglass (1934) and Monroe (1934) articles, and write a brief résumé 
and criticism of the material. 


3. Forms M and L of the Stanford Binet are parallel forms. One investigator 
uses form M in a school, and later another investigator uses form L on the same 
group. An enterprising student calculates the score differences to verify the formula 
for the standard error of measurement (1 — r)cz”. 


(a) Will he verify this formula? 
(b) What error measure would be verified for score differences? 
(c) If the verification were not precise, what explanation would be reasonable? 


4. One investigator used one form of an arithmetic test (a brief 5-minute form) 
in his investigation. Another investigator used another 2-hour arithmetic test and 
divided the total number solved by 24 to make results comparable with the 5-minute 
form. The standard deviation of the differences between these scores would be 
given by what formula? 


_ 5. Form M of the Binet test has been in constant use in a given clinic. The 
director orders that form L of the Binet test be used in the future. For uniformity 
all old M scores are to be expressed in terms of the new form. 


(a) What method will accomplish this with minimum error? 
(b) What formula gives this error? 


6. Under what condition will the standard error of measurement equal zero? 


T. Under what condition will the standard error of measurement equal the stand- 
ard deviation of true scores? 


8. Under what condition will the standard error of measurement equal the 
standard deviation of the test scores? 
9. Under what condition is the standard error of estimate equal to zero? 


10. Under what conditions is the standard error of esti 
0. 1 imate equal to th 
deviation of the variable being predicted? d Mice 
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t. Under what condition is the standard error of estimate equal to the standard 
deviation of the variable used for prediction? 

12. Mr. A obtains a score of 117 on test E (problem 1) and Mr. B obtains a score 
of 95 on test E. 

(a) What is the standard error of Mr. A’s test score? 


(b) What is the standard error of Mr. B's test score? 

(c) What upper and lower limits would be assigned to 
0.3 per cent level? 

(d) What upper and lower limits would be 
0.3 per cent level? 


Mr. A's true score at the 


assigned to Mr. B's true score at the 


13. Study the material given by Bradford (1940). Comment on his results. 


14. Study the equation for the reliability of a standard score given by Dickey 


(1930), and comment on this concept of error. 


2 


Various Interpretations 
of the Error of Measurement 


1. Error of measurement and error of substitution 

Having in the previous chapter shown the difference between several 
different types of “error,” we shall now consider more intensively the 
"error of measurement." Several alternative derivations or "interpre- 
tations" of this quantity will be given in order to show more clearly 
its properties and its meaning. 

The error of measurement has already been derived as the standard 
deviation of the difference between true and observed Score. However, 
the fact that this formula (s; V/1 — r ) involves the expression 1 — r 
instead of 1 — 7?, as does the error of estimate, usually provokes some 
inquiry such as, “Why don't you use 1 — 7? in the error of measure- 
ment? " It may be in order here to derive the error of measurement in 
some other ways to show the nature of the difference between it and 
the error of estimate. As shown in equation 7 of Chapter 4, the entire 
amount of the error made in predicting is called the “error of estimate." 
Also by inspecting equation 1 of Chapter 4, we see that the entire amount 
of the difference between the Score on one test and on the other test is 
charged to the "error." Tet us see what happens when this difference 
(z, — a) is charged partly to one test and partly to the other. 

If we assume that the error is partly in z and partly in y and that 
these two errors are uncorrelated, we obtain the following: 


(1) erties = d. 


Summing and squaring, we find that 


(2) Bey? + Deo? + 2Xe0, = Xq?. 

If we divide by N, 

(3) Ser + Se? ross, = Sa’, 
48 
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and if we assume that the error variance for one test is equal to that for 
the other test and that errors correlate zero, 


(4) 25? = Sa’. 
Substituting from equation 5 of Chapter 4, we obtain 
(5) so = si°(1 — M12), 


which is the error of measurement, as previously derived from the 


definition e = x — t. 
If the difference between two parallel tests is assumed to be 
divided into two equal and uncorrelated parts, the standard 
deviation of each of these parts is given by the usual formula 
for the standard error of measurement. 
The error of estimate, which uses the term (1 — 7?), is a measure of 
the total error made in predicting score on one test from score on another 
test by using the least squares regression line for prediction. 


2. Error of measurement as an error of estimate 


Let us consider another derivation of the error of measurement in 
order to aid in showing "why" we use (1 — 7) instead of (1 — 7?). The 
error of measurement will be shown to be the error of estimate obtained 
from the regression of observed score on true score. We may begin 


with the ordinary regression equation, 


o x 


Obtaining the error of estimate in the usual way, we see that it may be 
written 
(7) 

Since r,, = Vrer (see Chapters 


(8) eS ies 


urement. 


2 
Sera = s, V1 T 


2 and 3), we may also write 


Which is the error of meas 


' 


The error of estimate derived from the regression of observed 
upon. true score is the same as the error of measurement. 
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3. Error of measurement as interaction between persons 

and tests 

Those students who are acquainted with elementary analysis of 
variance methods, involving a first-order interaction—see Lindquist 
(1940) or Fisher (1916)—will find that the following matérial aids in 
understanding the error of measurement. Those unacquainted with 
analysis of variance should omit the remainder of this chapter. For 
additional material on the relation between analysis of variance and the 
error of measurement, see Hoyt (1941), Jackson (1939), (19405), and 
Kaitz (1945a). 

Let us consider first the case of two tests, designated as x and y. The 
average of the two scores (x + y)/2 is designated as a. We have thus 
the following matrix of scores for N persons: 


Z1 T2 tz +++ æy Mz first score 
Vi V» Vs ccc yw M, second score 
a @ Q3 -::* ay average of first and second scores. 


With no loss of generality, we can assume that the total mean (Ez + Dy) 
is zero; then the sum of squares due to tests is 


N(M? + My). 
- The sum of squares due to persons is 
Za?, 
and the total sum of squares is 
Da? + Dy? 
Thus the sum of squares due to interaction is 
Za? + Dy? — 2x9? — N(M + M). 
If, in the foregoing expression, we substitute (x + y)/2 for a and 


AE C tme - s z : 
write I“ for the sum of squares due to interaction, we have 


2. 2 2_ 2X(r-F yy 
(9) I = 2x? + By? — pro — N(M? + Mj). 
Expanding the third term gives 


aa 
Q0 P = Bx? + By? — L xy NOR + Mj. 
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Combining similar terms, and adding and subtracting NM-M, we have 


Equation (11) 


v2 2 
jg a — Sry + NM-My — N(M? + M,?) - NM-M. 
2 2 


By dividing the term in parentheses into two equal parts and writing 
them separately, we have 
Sx? — NM? H — NM? 
2 p-2 pum 92 (Xn — NM-M) 
2 2 


Nea A 
-Z OL? + My? + 2MM,).. 


We obtain the variance due to interaction by dividing the sum of squares 


due to interaction by the degrees of freedom (N — 1). Thus we obtain 


p xa? pes NM, zy — NM’ Sry — NM,M, 
2(N — 1) 2(N — 1) N-1 
N 


~ aN — 1) 


uu frs 


(M, + M,)?. 


the last term in equation 13 vanishes. The other 


Since M, = —My, d ; 
and covariances so that we can rewrite the 


terms are equal to variances 
equation as follows: 


2 2 W 
( L^ Sz Sy aa 
14 Fe ee a = ey 
) N-1 2 2 


Ifs, = s,, we may put s; in place of sy and use "ss instead of rzy, obtaining 
TU ON . 


i? endi 
— = 8g — Irr 
(15) W=1 
Which is equal to 
P 2x1 ) 
16 a = Bg A — Tax). 
(16) mM i 
The variance due to interaction between persons and tests is 


the square of the error of measurement. 


sation of the relationship between interac- 


Let us extend the demonstr d t 
of K tests, where K may be any 


tion and error variance to the case 
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number. We may begin with the matrix of scores 


M1 üz Tis "=" WK ay 
X91  Xo9  Xo3 ccc T2K ag 
$31. X32 T33  c^^ T3K a3 
ZNi tN2 UVN3 ^^^ TNK an 
M; M: Ms; +- Mk 0. 


Again let us assume that the grand mean is zero since this will simplify 
the expressions without loss of generality. Let us use ? and j (varying 
from 1 to N) as the subscripts representing persons, and g and h (vary- 
ing from 1 to K) as the subscripts representing tests. We may then 
write the total sum of squares as 


N K 
29 25 tigi 
i=l g=1 


The sum of squares for tests would be 


K 
N > Me, 
^ =l 
where : 


N 
2*5 Vig 


i=1 


where 


The sum of squares due to interaction may be designated by J? and 
written as 
K N K 
(17) Pe», a — K Da? -NDMA 
A ; m 


g=1 
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Using the definition of a; given above, we can write 


(18) Ped Da a DEL 4- wD Me. 


K? g=1 


This equation can be rewritten as 


(19) p- > [Ex] - ey 


Y g—1 i=l 


] Ki N 


== X Lir "b 


K gh-1 i=1 
Kia 


The last term may be written in two parts, and Y NM,M;/K can 


g#h=1 


be added and subtracted, giving terms that constitute variances and 


covariances as follows: 


(20) P= E -3 [E e- nue] 


M i=l 


1 K?—K N 
j= 3S b Ligtin — id 


K gh Lil 


K 
= 9 ENM - UI NM,Ms. 


K i T ul 


The last two terms can be combined into a squared term: 


e) P= k=) $ [Ex 2 -wae| 


K gl Li=l 
K?*—K 2 
Les [É ua - muan] - [2 m|. 
UK geh Li=l g—l 


mitted since ZM, is zero. We thus have the 


The last term can be o 
final expression for sum of squares due to interaction, as 


(22) P= Er > [= Dig? — xar? | 


gei Lil 
1 K?—K 


[= Tigtih — NMM |; 


TR g+h=1 
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Dividing by the appropriate degrees of freedom (K — 1)(N — 1), we 
obtain 


N 
[> du = NM? 


P i ees 
ae y | 
(23) (K-1)N-1l) KZ N-1 
N 
ce | 22 tata — NM,M, 
i=l 
-XK-),24 N-1 | 
which can be rewritten as an average variance and covariance. 
p 1 K 1 C-K 
4 TR TAA OON SS Se Sa ams V'ehSgSh- 
id (K—1(N—1) K 2 wo P-K P i 
This gives the final expression for variance due to interaction: 
ie — | 
25 M————— = (s. — T'ghSgSh. | 
ER (KDN T See) = nas 


That is, the variance attribut 
tests is the average test v. 
Since we are dealing with 
are equal and that the in 


able to interaction between persons and 
ariance, minus the average intertest covariance. 
parallel tests we may assume that the variances 
tercorrelations are equal, giving 


p : | 
(26) E-ga- g YU- 


For the general case 
interaction between 
of measurement. 


of K parallel tests, the variance due to 
persons and tests is the square of the error 


If the error of measurement is small 


deviation of the test, the interaction variance is small (as compared with i 
the true variance discussed in the next section), and the different tests 

are highly correlated. Correspondingly, if the interaction variance is 

large, as compared with variance due to persons, the error of measure- 

ment is large, relative to the stand 


ard deviation of the test, and the | 
reliability of the test (the correlation between parallel forms) is low. 


4. Relation between true variance a 

In considering the relationship between test theory and analysis of 
variance, we have shown that interaction between persons and tests is 
identical with error of measurement. It is clear that if all the tests 
had the same mean, the variance due to tests would be zero. Thus the 


as compared with the standard 


nd variance due to persons 
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variance due to tests simply measures the extent to which the means of 
the different tests are identical. If this difference is unduly large, the 
tests are not “parallel” in the sense of having equal means. 

Since in test theory, the “true variance” represents a variance between 
persons, we should expect to find some relationship between true 
variance and “variance due to persons." Referring to the previous 
section, we find an expression for the sum of squares due to persons. 


Designating this by P?, we may write 
N 

(27) P? - K aj, 
i=l 


where 


x Tig 
s=l 


(28) Be 
We may substitute equation 28 in equation 27, obtaining 
K N [ K ll 
c P-— D tig} = 
(29) . KO ici Le=1 ir 


The square of a sum may also be written as a double summation, giving 


2 


(30) P 


N K 


hàEiEe 


Changing the order of eee we obtain 


i i 
(31) =" S s is Vigtih- 

K gai k=1 ii 
Let us now explieitly separate the terms involving a sum of squares 
from those involving a cross product, je 
(82) P=—y Lal +e LP E Peu. 


K gai i= K gat aiia 
(eh) 


We observe that, since all terms where g = h are excluded from the 


second expression, it contains only (K? — K) terms. In order to have 
the upper limit of summation indicate the number of terms, we shall 


write equation 32 as follows: 


(33) a x [= Tig | ++ = | x nera - 


Kr ied Kg bia 
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Referring to the preceding section, we note that we assumed the mean 
of all means to be zero, but did not specify that the mean of each test 
was zero. Therefore, in order to obtain deviations from the mean, it 
is necessary to rewrite equation 33 as 


1 K N 1 K?—K N 
(34) K 23 p Tig = d an K 27 2 Tigtih — van, | : 
g=1 Li=1 g+h=1 Li=1 


This change, however, does not affect the value of P? since the terms 
added total to zero. We can see this by writing them explicitly: 


K K?—K 
(35) È NM? + X, NM,QM,, 
g=1 g+h=l 


and then, expressing equation 35 as the square of a sum, we write 


K 2 
(86) N [= m] s 

gz1 
Since the term in brackets is the grand mean, it equals zero, which shows 
that the value of equation 33 is equal to the value of equation 34. 
Hence we may write 
Equation (37) 


MINE rur 1 K-K py 
P cU [E a- vue] + D [ È tuta - vat]. 
K pa Lia K gal lici 

Dividing now by the appropriate degrees of freedom (N — 1), we obtain 
Equation (38) 


N 


N 
pi s Plz — NM,? | K-K 2 Xigtin — NM Mp 


p ici 


a ees, N-1 $ 


K &+h=1 N-— 1 
NS ati rewri A E 

This equation may be rewritten explicitly in terms of average variance 

and covariance, giving 

p? 


(39) Yo = GP) + - Dass. 


Equation 39 gives the value of the variance due to persons 


t - It is the 
average variance plus (K — 1) times the average covariance. 
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We can see that, if we divide equation 39 by K, we obtain 


B (sg) ( 3 
40 - 1 T 
"a Eycn a Et 


and that, if we let K approach infinity, we obtain 


(1 li P 
"a tL) ERR D 


= TegnSgShy 


which is equivalent to true variance as discussed in Chapters 2 and 3. 


For K parallel tests, one Kth of the value of the variance due 
lo persons approaches the true variance as K approaches in- 


Jinity. 


5. Summary 

Several different interpretations of th 

been given. 

1. The error of measurement is the standard deviation of the differ- 
ences between the observed score and the true score. 

2. If the difference between score on two parallel tests is regarded as 
being made up of two equal and uncorrelated components, the 
standard deviation of the distribution of these components is the 
error of measurement. y 

3. The error of measurement is identical with the error of estimate 
based on the regression of observed on true scores. s 

4. The error of measurement is the square root of the variance due . 
to interaction between persons and tests, provided one assumes a 


set of parallel tests. 


e error of measurement, have 


ations of error variance, we have also 


In addition to these interpret 
ow e limit, as K approaches infinity, of 


seen that the true variance is th 
1/Kth of the variance due to persons. 


Problems 


ded as divided into two equal and uncorrelated 


1. 1 — ga is Tegar A 
Show that, if zı — 72 f these parts is the error of measurement. 


parts, the standard deviation of one 0: 


2. Show that the error made in estimating observed score from true score is the 


error of measurement. 


3. Show that the standard deviation of the difference between observed and true 


Score is the error of measurement. 
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The average may be found by summing and dividing by N, obtaining 


N N N 
3X. Dia 2 Xy 
$ed ici mi 

(2) N N N 


Since the mean equals the sum of scores divided by N, we have 
(3) M. = Mi + Ms. 


Since the two tests are parallel, the mean of test 2 will be equal to the 
mean of test 1, and we have 


(4) M, = 2M. 


Doubling the length of a test doubles the mean, provided the 
original part and the added part are parallel tests. 


3. Effect of doubling a test on the variance of gross scores 


We shall next observe what happens to the variance of a test when 
its length is doubled. Again we begin with the gross score 


(5) X, = X + Xo. 


Since the mean of the combined tests equals the sum of the two part 
means, we may convert to deviation scores by writing 


Or X, — M, = (X1 — M3) + (X2 — Mə), 


which may be written 


(7) Le = T1 + Xo. 
Squaring both sides, summing, and dividing by N, we have 
8) Dre Le Ze? Xa Xr.) 

N N NN. N 


Expressing this in terms of variances and covariances, we have 
(9) sf engl? -L- 55555. 

Since the tests are parallel, s; = s», and we may write 

(10) 8? = 28?(1 + 73). 


We may take the square root of both sides in order to obtain the 
standard deviation. This gives 


(11) Se = 81V 2(1 + r12). 
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Doubling the length of a test increases its standard deviation 
as indicated in equation 11, provided that the original part 
and the added part are parallel tests. 


4. Effect of doubling a test on true variance 

Since the “true score” of a given person is the same on the original 
and on the new part of the test, his true score on the combined tests 
is double the original true score: 
(12) T, = T, + To = 2T. 
Since the mean true score is likewise doubled, we may also write the 
same equation in deviation score form as 
(13) te = 2th. 


Squaring, summing, and dividing by N gives 


Si? 42h" 
ye CES A 
which may also be written 
(15) 32 = 4817. 


Taking the square root of both sides gives 
(16) 8, = 25. 
the test doubles the true standard devi- 


Doubling the length of ida 
e true variance, when the original part 


ation, or quadruples th 
and the added part are parallel tests. 


5. Effect of doubling a test on error variance 


From equations 5 and 12, we may write 
(17) Dna c (Ki — Tı) + (X2 — T3). 
This expression is clearly the “error score” for each of the part tests 
and the composite; therefore we may write 
(18) te = 61 + e2- 


Squaring both sides, sumnung, and dividing by N gives 


ze? Ze” Ze? | 2Xe69 
(19) Ae? 73. qp —— 1 
N N N N 


which may also be written 
(20) sl = Se + Sen Praef 


» 
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Since by the definition of random error the correlational term vanishes, 
and the error of measurement in 1 is equal to that in 2 because the tests 
are parallel, we may write 


(21) 8.2 = 25,2. 
Taking the square root of both sides gives 
(22) Se, = Sa V2 


When a test is doubled in length the error variance is doubled 
or the error of measurement is multiplied by the square root 


of two, if the original part and the added part are parallel 
tests. 


6. Relation of true, error, and observed variance 


Let us check to see if equation 18 of Chapter 2 (the true variance 
plus error variance equals the observed variance) holds for the double- 
length test. Equations 10, 15, and 21 give the value of the observed, 
true, and error variance for the double-length test. Let us set equation 
10 equal to the sum of equations 15 and 21 and see if this gives an 
identity. We have 


(23) 2s1°(1 + r12) = 4s)? + 2s,,2. 


As has been shown previously, in equation 38 of Chapter 2, 7198)? = s, 
and s? = s,?(1 — r); see equation 41, Chapter 2. Substituting these 
values in equation 23, we have 


(24) 25?(1 + r) = 4rs;? + 28?(1 — r). 


Since this expression is an identity, the relationship previously estab- 
lished for the single-length test still holds for the double-length test, so 
that the equations developed are not inconsistent among themselves. 


Observed variance is equal to true variance plus error variance 
for the double-length test. 


7. Effect of doubling the length of a test on its reliability 

(Spearman-Brown formula for double length) 

Since the reliability of a test means its correl, 
form, we shall assume four unit parallel tests designated by the sub- 
scripts 1, 2, 3, and 4, and then determine the reliability of a test of 
double length by obtaining the correlation of 1+2and3+ 4. Sub- 


ation with a parallel 


Chap. 6] Effect of Doubling Length on Test Parameters 63 


stituting in the deviation score formula for correlation, we may write 
(25) TQ 42) (844) a m ud " 

VEl + a2)! V/E(rs + 24)? 
This equation can be expanded into 


Equation (26) 


Drizz + 1E t4 + Erots + Etot4 
T(142)(8+4+4) = : : : š 
XOGTO Aa + Lx” + 2rr2 Vry + Dag? + Wagry 


Dividing the numerator and the denominator each by N, we can write 
and covariances. We may also simplify 
at, since we are dealing with parallel 
al that of 3 4- 4. Making these 


the result in terms of variances 
the denominator by noting th 
tests, the variance of 1 + 2 will equ 


changes gives 
Tia8183 + 1238253 + ry48i84 + 1048284 


97 " - 
(27) TQ 42) G4-0 si + s? + 2rissie 


We may write this result in terms of average variance and average 


covariance: 
2(renSeSn) 
-—————. 
(28) TQ 4:2) 4-4) Bet rites 


the variance of the first test and the co- 


Since we have parallel tests, i d i 
e used in place of the average, giving 


variance of the first two can b 
271258182 

x a AA m 

(29) Ta496+9 7 53 E resse 


can divide numerator and 


ations are equal we 
the reliability of the 


Since the standard devi 
stand for ra+2 8+9: 


denominator by si^. Let rc 


composite test. T 
"12 


3 on (Spearman-Brown formula 
(30) ? Dua for double length). 


est is doubled by adding a parallel form, 


he length of at X 
deb A reased as indicated by equation 30. 


the reliability is ine 
This is the conventional Spearman-Brown formula for double length. 
It gives an estimate of the reliability of a test if the test is doubled in 
length. Equation 30 is probably more commonly used than any of 
the other equations in this chapter. Whenever the reliability of a test 
is computed by correlating odd with even items, or some other split-half 
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method, the correlation between the halves is substituted (as r12) in 
equation 30 to give the reliability coefficient for the total test. In this 
manner we have an estimate of the correlation of the test with a parallel 
form. Equation 30 is, of course, not used when reliability is determined 
by correlating two parallel forms of a test. 

By checking back over the derivation, we note that nothing has been 
assumed except that the tests in question are parallel. More explicitly, 
equality of standard deviations and of intercorrelations has been 
assumed, but nothing else. In other words, it has been assumed that 
Sı = So = 83 = $4 and that rj» = 713 = T14 = T23 = T24 = T34. If these 
two sets of equalities hold, then the Spearman-Brown formula is simply 
a computational short cut for figuring the reliability of the double-length 
test. By the device of using average variance and average covariance, 
as in equation 28, we see that, if the variance s;? and the covari- 
ance 7128182 differ only slightly from the average variance and covariance 
in equation 28, then equation 30 gives a good approximation to the 
reliability that will be found for the doubled test. 

The application of formulas for a double-length test may be illus- 
trated by the following example. A 50-item test has a mean of 42.0, a 
standard deviation of 5.3, and a reliability of .85. By the use of equation 
38, Chapter 2, or equation 19, Chapter 3, the true variance of this test 
is 23.88; and from equation 41, Chapter 2, or equation 37, Chapter 3, 
the square of the error of measurement is 4.21. If this test is increased 
to 100 items by adding another 50 items that are parallel to the first set 
of 50, we should find the following statistics for the 100-item test: 


Mean 84.0 (from equation 4); 
Standard deviation 10.2 (from equation 11); 
Variance of the errors of measurement 8.42 (from equation 21); 
True variance 95.52 (from equation 15); 
Reliability 


92 (from equation 30). 


The important thing to emphasize here is that to the extent to which 
we are able to construct parallel tests we do not have a “prophecy” 
formula, but simply a computing formula. If the Spearman-Brown for- 
mula fails to “work” or predicts “inaccurately” in any case, this simply 
means that the correlation used (for example, the correlation 7,3) was 
larger or smaller than the average of the intercorrelations rgan. "There 
have been several “empirical studies” of the accuracy of the Spearman- 
Brown formula. Strictly speaking, it needs no verification, and cannot 
be verified. It is possible, however, to intercorrelate the four halves of 
the two parallel tests and to see if the tests are really parallel in the sense 
that rie = 13 = r4 =-= 734, and in the sense that the four variances 


. warm-up or fatigue effects result 
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are equal. If so, the Spearman-Brown formula cannot fail to “work” 
except through arithmetical error; and, if these assumptions are not 
verified, the Spearman-Brown formula does not apply, since the tests 
are not parallel. It is possible to investigate empirically the extent to 
which the amount of departure from “strict parallelness” that is usually 
found affects the applicability of the Spearman-Brown formula. 

The foregoing remarks apply to a test that is primarily a power test. 
If a test is a pure speed test, the equations showing the effect of test 
only if the test is administered on a work limit basis 
d his time is recorded), and no serious 
from doubling the number of items and 
allowing each person whatever time is needed to just complete the 
lengthened test. For the more usual group speed test we have a time- 
limit method. In this method a fixed time is allowed, and the number 
of items completed is recorded. Ideally, there should be so many items 
that only the fastest person would complete the test in the time allowed. 
Then doubling the “test length” would mean doubling the number of 
items and allowing double the original time. The formulas for the effect 
of doubling test length would then apply, if there were no serious 
warm-up or fatigue effects during the second half of the longer time 
limit. The only general rule that can be given is to point out that the 
added portion of the test must be parallel to the old portion; and that 
the criterion of whether or not the two parts are parallel is the equality 
of means, variances, and errors of measurement for the two parts. To 
the extent to which these equalities do not hold, the formulas for the 
effect; of doubling the length of the test will not apply, and cannot be 
expected to apply. A statistical criterion for parallel tests is given in 


Chapter 14. 


length will apply 
(each person finishes the test an 


8. Experimental work on the Spearman-Brown formula 

It is interesting to note that the formula for the rehability of a double- 
length test (equation 30) and its generalization to K-parallel tests, 
equation 10 of Chapter 8, were first derived in 1910 in the British 
Journal of Psychology, Volume 3. Spearman presented it on page 290, 
and William Brown presented it in the succeeding article (see page 299 
of the same volume). It is therefore referred to as the Spearman-Brown 
formula. 

The earlier articles on the Spearman-Brown formula were vigorously 
adversely critical; see Holzinger (1923b) and Crum (1923). Holzinger 
concluded that if we lengthened the test beyond five times its original 
length, the Spearman-Brown formula gave an overestimate of the re- 
liability. No mention was made of the decreasing likelihood that the 
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first correlation would be equal to the average of all correlations or the 
decreasing likelihood that the first variance would equal the average of 
all variances. 

Subsequent studies have led to the conclusion that the results obtained 
from the Spearman-Brown formula are reasonably accurate; see Kelley 
(1924), which is a reply to Crum’s criticism, Kelley (1925), Holzinger 
and Clayton (1925), Ruch, Ackerson, and Jackson (1926), and Wood 
(1926c). Thurstone (1931la) has reviewed and summarized much 
of this material on the empirical verification of the Spearman-Brown 
formula. 

Slocombe (1927a, 1927b) reviewed several studies. He pointed out 
that it was assumed that the coefficient substituted in the formula was 
"representative" of the group so that this coefficient must be selected 
with care. 

Dunlap (1933) in discussing the problem of test reliability suggested 
that a tetrad technique should be used to determine whether or not 
split fourths of a test are measuring the same thing. It should be noted 
that a different assumption is made in the actual derivation of the 
Spearman-Brown formula. 

It has also been suggested that the Spearman-Brown formula applies 
to rating scales and judgments and that it may be used in predicting the 
reliability to be expected by increasing the number of judges or raters; 
see Gordon (1924), Furfey (1926), Remmers, Shock, and Kelly (1927), 
and Remmers (1931). In general, this suggestion has been found to be 
correct. By considering the only assumption made in the development 
of the Spearman-Brown formula, we see that, if the variance and reli- 
ability for the first rater are equal to the average variance and inter- 
correlation for all raters used, the Spearman-Brown formula must give 
the correct result. If the formula does not give the correct result, it is 
because the variances from rater to rater or the covariances for various 
pairs of raters differ markedly. In other words, the problem here is not 
“Does the Spearman-Brown formula work?” but is “Do the ratings 
from the different raters satisfy the criterion for parallel tests?” 

Lanier (1927) interpreted the work of Gordon (1924) and Kelley 
(1925) to mean that correlations increased as the number of cases 
increased. He showed that this was not true. Thurstone (1928a) points 
out clearly the difference between number of judges making a rating 
and number of cases in a correlation scatter plot. _ 

It has also been suggested that increasing the number of choices in a 
scale might increase the reliability according to the Spearman-Brown 
formula (see Remmers and Ewart, 1941); and that increasing the 
number of alternatives in a multiple-choice test will increase reliability 


Li 
e 
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in the same way (Remmers, Karslake, and Gage, 1940, and Denney and 


Remmers, 1940). . 

Denney and Remmers reported that random elimination of incorrect 
alternatives from a five-alternative multiple-choice test resulted in a 
reduetion of reliability that agreed with the Spearman-Brown formula. 
However, we see that there is no reason to expect the formula to work 
in this case, since it is meaningless to think of equality of variances or 
covariances for alternatives on a multiple-choice test. The logie relating 
number of alternatives to test reliability has been given by Carroll 
(1945) (see formula 30, page 11). An empirical study of this formula is 
being undertaken by Mrs. Plumlee, assistant director of Test Develop- 
ment for the Educational Testing Service. 

Nomographs or tables for the Spearman-Brown formula are given by 
Arnold and Dunlap (1936), Cureton and Dunlap (1930), Dunlap and 
Kurtz (1932), and Edgerton and Toops (1928b). 

In general, therefore, the work on the Spearman-Brown formula 
shows that even when relatively little effort is made to obtain parallel 
tests (or ratings) the formula gives reasonably good results. From the 
derivation it is clear that, if we are dealing with tests or ratings known 


to be parallel, the formula must give correct results. 


9. Summary 
For the double-length test it has been shown that: 


(4) M, = 21, 

(11) s = si V3 + n3, 

(16) s, = 28 

(22) Se. = Sa V2, 
2ri» 

(30) fa dis i 


cript c is used to denote the value for the 
st, the subscript 1 designates the mean or 
t-length test, and rj is the reliability of 


For these equations the subs 
composite or double-length te 
standard deviation of the uni 
the unit-length test. 
Problems 


1. Prove that the correlation between true and observed score for a test of double 
length is V/2r/(1 + r), where r is the relial 
tions used in making this derivation. 


bility of the original test. List the assump- 
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2. Prove that each of the other basic relations derived in Chapter 2 (or Chapter 3) 
holds for the augmented test. 


3. 


Test Number Mein Standard | Relia- 
á of Items | ` Deviation | bility 


A 250 186.1 30.0 .96 
B 30 20.0 4.5 .72 
Cc 100 69.3 12.4 .87 
D 200 83.7 22.8 491 
E 50 37.4 7.4 .83 


Estimate the reliability of cach of the foregoing tests if it is doubled in length. 


4. Read the last section of the article by Lanier (1927) and comment on this 
application of the Spearman-Brown formula. Refer also to Thurstone (1928a). 


7 


Effect of Test Length 
on Mean and Variance (General Case) 


1. Effect of test length on mean 

We shall now extend the discussion to consider the general case, the 
effect of test length on mean, variance (true, error and observed), and 
on reliability and index of reliability. 


If we designate the composite score of the ith person by Xie, we may 


write 
a Ke = Xa + Kat Xe 
Summing and dividing by N to obtain the mean, we have 

ZX, 2Xa , ZXe ZXik 
(2) cis c DNE E, 
Substituting the mean for the sum of scores divided by N, we have 
63) M. = Mı Msc Mx. 


Since all the tests are parallel, the mean of each of the component tests 


will be equal. M; can be substituted for each of the means, giving 


(4) M, = KM. 
of a lest K times multiplies the mean 


Increasing the length i 
h of the new parts is parallel to the 


by K, provided that eac 
original. 
2. Effect of test length on variance of gross scores 
As before, we can begin with the expression for the composite gross 
Score, 


(5) Xa = Xt ees Xr- 
69 
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Since the mean of the combined tests equals the sum of the part means 
(see equation 3), we may convert to deviation scores by writing 

(0 X.— M. = (X1 — My) + (Xs — M3) +- 4 (Xx — Mx). 
Using lower case x for a deviation score, we may write 

(7) l Te = t1 H t2 H+ TK- 
In order to obtain the standard deviation, we square both sides, sum, 
and divide by N, obtaining 
Ze? Bey + te +++ + 2x)? 

N N f 


If we expand the numerator of the right-hand side of the equation, it 
will equal the sum of all the terms in the following matrix. 


(8) 


Zr? Sayzq Xxx -+ Bek 
2 S 
Irit Ers Ltt, +++ XExovk 
Iriz Ersta Day” eco Exgtk 
a 2 
DaytK Dtg  Z€x.vk +++ Drg“ 


Dividing through by N, we have the sum of all the terms in the 
variance-covariance matrix that can be written 


2 


E T128182 7138183 Uc TIKSISK 
2 
T128182 $37 1238283 tt T2KS28K 
2 
7138183 T238283 — 837 T3K838K 
T1KS18K T2KS828K — T3KS38K 7^7 Sk’. 


The sum of all these terms is the variance of the composite test composed 
of the sum of all the tests from 1 to K. 


E E 


K 
(9) s? = 25 sg + 25 27 TghSeSh (g = h). 


g=1 g=1 h=1 


The variance of the composite test may be expressed in terms of the 
average variance and the average covariance as follows: 


(10) s? = K(s?) + K(€ — 1) (rgnsgsi)- 
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If the tests are parallel, the average s; may be replaced by s;, and the 
average covariance by 128182. If we factor out the s;?K, we have 


(11) sè = s2K[1 + (K — Dri. 


Lengthening a test K times increases the variance, as indicated 
in equation 11. : 


Taking the square root of both sides and writing r;1 for the reliability 


coefficient, we have 
(12) s, = VK + K(K — Urn, 


standard deviation of the unit length test, 


ability of the unit length test, 
ber of items in the new test to the num- 


where s; is the 
111 is the reli 

K is the ratio of the num 

ber in the unit length test, and 


s, is the standard deviation of the lengthened test. 


Multiplying the length of a test by K increases the standard 


deviation, as indicated in equation 12. 


3. Effect of test length on true variance 
Since the “true score" of a given person is the same on each of the 


part tests, we may write the true score on the composite test as 
(13) 


B "T 
Since the mean true score 15 li 
this same equation in deviation form, 


T,- KT, 


kewise multiplied by K, we may write 


(14) ta = Ku. 


Squaring, summing, and dividing by N, we obtain 


N j N 
> tic” K? = ta? 


(15) E = y s 


which is equivalent to 


(16) 32 = KW. 
Taking the square root of both sides, we have 
(17) Se = Ks, 


Multiplying the length of a test by K multiplies the true 


standard deviation by K. 
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4. Effect of test length on error variance 


From the equations developed in the preceding sections (see equations 
5 and 13), we may write 


(18) X5 — Te (X =F) (Be — To) ++ 4 (e — Bx): 


We may use e to represent the error score and write 


(19) ec = € + e2 + e3 H+ ex. 
Squaring both sides, summing, and dividing by N, we obtain 
- (20) Ze? _ Ber tee tet ex) 
N N 


'This expression may be set up as the sum of all the terms in the follow- 
ing matrix: 


Ze? Bee: Beez +- Beek 

Beje Be? Beez +++ Lever 
2 

Deez Beez Bez” e Lesex 

Zeek Zeeg Begeg +++ Dex. 


Dividing by N, we have 
QU se = Sa? F Beg? Heeb Sex? F TereSeiSea 


F TeresSeySes Tr 7 te Mex _rexSex Ser: 


Since, by the definition of random error, the correlational terms vanish, 
we can write 


K 
(22) SD NP. 
g=1 


This may also be written in terms of the average error variance as 
follows: i 


(23) 8.2 = K(6,2)- 


If we assume that the error variance of the first test is equal to the 
average error variance, we may write 


(24) 82 = Ks. 
Taking the square root of both sides gives 


(25) &, = sa VK. 
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Multiplying the length of a test by K multiplies the error vari- 
ance by K or the error of measurement by the square root of K 


5. Summary 
For the general case of lengthenin i 
g a test to K times its origi 
length, the effect on the mean and the different variances " ut 
shown. We have TIS 


(4) M, = KMi, 
(12) s = sı VK + K(K — Dn 
(17) S, = Ks 


(25) Se, = s, VK. 


As in Chapter 6, the subscript c is used to designate the composite 
score. In this case, however, it is the composite score formed by adding 
scores on K parallel forms. The subscript 1 is used to designate a mean 
or standard deviation of the original unit test, and rj, is the reliability 


of the original unit test. 
Problems 


1. Prove that the observed variance of an augmented test is equal to the sum of its 
true and error variance. 
other basic relationships established in Chapter 2 and Chapter 3 


2. Prove for the 
or the test augmented K times. 


that they still hold f 
3. 


fs Standard | Number Relia- | Number of 
Test Mean | Deviation | of Items bility | Subjects 


A 73.2 12.7 120 .02 300 
B 17.3 3.8 25 -86 250 
Cc 21.3 vo! 50 .80 430 
D 29.3 7.9 75 .84 150 
E 56.5 13.7 100 .89 200 


(a) Estimate the variance of test A if it is increased to 240 items. 
(b) Estimate the true variance of test B if 75 items are added. 
nent for test C? 


(c) What is the error of measuren 
(d) What will the error of measurement be if test C is lengthened to 150 items? 


(e) How many items would need to be added to test D to double the true variance? 
(f) How many items would need to be added to test E to double the error of 

measurement? 
(g) You would like to increase the standard deviation of test B to 7.6. How many 


items would it be necessary to add to the test? 
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Effect of Test Length 
on Reliability (General Case) 


1. Introduction 

The equation for the relationship between test length and test reli- 
ability will be developed from the usual formula for the correlation 
between any two sums. No assumptions will be used in the derivation 
until the last step. There it will be assumed that the variance of the 
unit test can be taken as a fair approximation to the average variance 
of all the unit parallel forms, and the reliability of the unit test times 
its variance can be taken as a reasonable approximation to the average 
covariance among all the unit parallel forms. 


2. The correlation between any two sums 


First let us write the formula for the correlation between any two 
sums. One series will be designated by the subscripts 1, 2, +77 K, and 
the other by the subscripts I, II, -++ L. From the usual formula for 
correlation, we have then 


(1) Gy ben) Gp +++ +21) 
ed Day robe rk) t entet) O, 
VS (01 + ty feet xk)? V EQ + an dod en)” 


The terms involved in the expansion of the numerator can be system- 
atically set down in the following rectangular matrix: 


Layer Lay Eriz o coco EtL 
Zuexi rət Brym -+ Ex, 
Zagxp Tasty Vageyr o coco Tasty 
LLRX Zrkxn Dec ars ZrkTL. 
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First, xı is multiplied by az, tm, ++ tz; the same is done for 2», giving 
the second row; and so on to zx. The sum of all these LK terms is 
equal to 

D(a + te beet ek) zm Xz). 


Each of the terms in this matrix may also be written as a covariance, 
giving 


Nrussuo coco NT 1,08182 
Nromsesi tt) N12,n828h 
Nrkasksi NrxwsKsit co Nri.LSKSL- 


Using g and G as the general subscripts designating tests, where g varies 
from 1 to K and G varies from I to L, we can write the sum of all the 
terms in the foregoing matrix as follows: 


K L 
E 2 Nreasese- 
g=1 G=I 
Since the N is a constant for all terms, we may take it outside, writing 
the following equation for the numerator term of equation 1. 


Equation (2) O: 
Zn + E teen Hes oF tn) = ND 2j reasese- 


g=1 G=I 


We also follow the same procedure for the denominator. The terms 


In the expansion of 


Za + te beet ax) 


may be set down in a square matrix as follows: 


Eu? — Xu Exits Errr 
Eritz Ir? Etts DrstK 

ra wen Suppe 
Irr Erts 23 LVagrx 


2 
Xa,zy ZYK Erze ccc Beg”. 
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This matrix may also be rewritten in terms of variances and covariances 
as follows: 


Ns? Nrissi$ — NrigsiSg, +e Nr xsisK 
2 
Nrizsıs2 Ns" Nro38283 s Nroxsosk 
Nri3sis3 Nrogses3 Nsa? -+> NT3KS3SK 
> E 
Nriksisk | NroksoSk Nr3xs3sx ++: Nsp’. 


Again the sum of all the terms in this matrix may be written in a 
double subscript notation by using the subscripts g and h to designate 
the tests, having the limits for both g and h be from 1 to K. We may 
thus write 


Equation (3) 


K X 
Za +z + ts +H ar)? =N 22 22 rensesn (where rj, = 1). 
g=1 h=1 


However, it will be noticed that the terms along the principal diagonal 
of the matrix are variances, while the non-diagonal terms are covariances. 
It is sometimes better to use a notation that keeps the variances and 
covariances separate and to write 


K K K 
(4) Elei + we + xg ++ wx)? =N > sg +N p» »» V'ghSgSh 
c= pl h=1 
aie (g = h). 


It is necessary to specify that “g does not equal A" since the cases 
where g does equal h have already been included in the sum of the ` 
variances, 

By symmetry a term corresponding to equation 4 can be written for 
the second factor in the denominator of equation 1. For this case where 
the limits are I to L, let us substitute I for 1 and L for K in equation 4. 


Also let us change subscripts, using G instead of g and H instead of h. 
Making these substitutions, we have 


D i i 
(5) 2@r1+ an + an +-+-+ -NOYs-u4-N $5 DX vensesu 
Cat GE HEI 
(G = H). 
Using the double subscript notation indicated in equations 2 to 5, 
we can write the formula for the intercorrelation of any two sums in 
terms of the standard deviations and intercorrelations of the unit tests. 
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Substituting equations 2, 4, and 5 in equation 1, and dividing the numer- 
ator and denominator by N and using Rxz for r+... cri) p pur) dn 
equation 1, we have 


Equation (6) 


Tou 
25 25 TeGSg8G 
g=1 G=I 
Rin R K K | L Z a E 
A Ese M Mrass D sèt LD reusesn 
g=l g=l h=1 N G-—I G-—I H -—I 
m (@#H) 


This formula is found in Spearman (1913) and Kelley (1923c). We may 
also write the foregoing in terms of the average variance and the average 
covariance of the unit tests by substituting N times the average for the 
sum, and denoting the average by a bar above the term. 


Equation (7) " 5 
A l'gGSg8G 


VEG) 4- KK — Dirassm) VLG) + LL — Doasosn) 


Rg, = 


It should be noted that this equation is general and precise. It in- 
volves no assumptions whatever. It is based only on simple algebraic 
transformations, and it must be verified, barring arithmetical errors. 


d 7 give the correlation of any two sums in 
covariances for the unit tests. These 
basis for the derivations in this and 


Equations 6 an 
terms of variances and 
two equations form the 
the succeeding chapter. 


3. Effect of test length on reliability (Spearman-Brown formula) 

However, in the usual case of trying to estimate the reliability of an 
augmented test, the average of the variances of the parallel unit tests 
that might be constructed is not available. Likewise the average 
Covariance among these unit tests is not available. The only figure 
available is the variance of the first unit test and the reliability of this 
test. If we are willing to assume that the variance of the first unit 
test is a fair approximation to the average variance of all unit tests, 
and that the reliability coefficient times this variance is a fair approxi- 
mation to the average covariance, we shall have some values to substitute 
in this formula. It should be noted that, unless we do this, the formula 
cannot be used at all. It should also be emphasized that, if the new unit 
tests have an average variance that is different from the variance of 
the first test, or an average covariance different from 71181”, the new 
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tests are not parallel forms of the original unit test. In other words, 
if the new unit tests are parallel to the original one, the assumption is 
valid and the formula will hold. If the formula does not hold, the new 
tests are not parallel forms of the original one. 

Making the substitution indieated in the foregoing paragraph, we 
shall set 


s1? = (s) = (sc?) 
and 


T1181? = (renSe8n) = (rensosu) = (rgassso). 


We may write the generalized formula giving the effect of increasing 
test length on reliability as follows: 


ia KLrj38;? 

m) Re Ja F EE legal le? a DOE paT 
This general formula may be simplified in several respects. For reli- 
ability, a test is assumed to be correlated with another form of the same 
length so that K = L. This means that the two expressions under the 
radicals in the denominator are identical so that the product of two 
square roots may be indicated by simply writing one of the expressions 
without the radical sign. These changes give 


Kr, is? 
Ks + K(K — Drus? 


Dividing numerator and denominator by Ks,”, we have the final 
formula which is 


(10) 


(9) Rkk 


Krii 
+ (K — Dru 
where 7,4 is the reliability of the unit test, 


K is the number of items in the lengthened test divided by the 
number of items in the unit test, and 
Rxx is the reliability of the lengthened test. 


Rrx (general Spearman- 


Brown formula), 


Making a test K times as long increases the reliability as 
indicated in equation 10. 


This is the generalized Spearman-Brown formula showing the rela- 
tionship between test length (K) and reliability. As mentioned pre- 
viously, derivations of this equation were published simultaneously by 
Spearman (1910) and Brown (1910). In view of the controversy waged 
around this equation, it should be emphasized again that no assumptions 
were made in deriving it, except that the variance and covariance figures 
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obtained for the first unit test could be used in place of the average 
variance and the average covariance among the unit parallel tests. 
These assumptions are part of the definition of “parallel tests.” 


4. Graphing the relationships shown by the Spearman-Brown 

formula 
By regarding R and ras the variables in equation 10 and K as a param- 
at this is the equation of a rectangular hyperbola. 


eter, we can show th 
In order to show this, let us first subtract each side of the equation 


from 1 + 1/(K — 1), giving 
1 Kr 


1 
1 m c — : 
as ae a ae em 1+ (K -—1Dnr 
s in the left-hand member and simplify the 


We may rearrange term t 
-hand member of equation 11 to give 


first two terms in the right 
1 K Kr 
(12) Lpe-Éd—1 Kel 14+ (K - 1)r 


Putting the terms in the right member over 2 common denominator and 


simplifying gives ; 
| 1 K 

s 1-R+E i (K-D0U-(G - 0j] 

We may write the denominator of the right-hand term as (K — 1)? 
[r+ 1/(K — 1)], and then multiply both sides by r+1/(K — 1), 
obtaining the usual form for the rectangular hyperbola (vy = c). 


1 1 | K 
m Euro (K - 1)? 
This equation has been graphed in Figure 1. This figure shows the 
arious values of K. It can be seen 


relationship between Æ and r for v : 
the value of X is the same as r for 


that for r-values of zero and unity, sas 
all values of K. For other values of 7, Æ increases as K increases. It 


can be seen that these hyperbolas have a horizontal asymptote equal to 
1+1/(K —1). That is to say, aS T approaches infinity, R approaches 
& horizontal line 1/(K — 1) units above one. Also as R approaches 
minus infinity, r approaches negative 1/(K — 1) Since r and R desig- 
nate reliability coefficients, values outside the range zero to one are 
meaningless, and are not shown in the graph. 

Equation 10 can also be graphed by regarding R and K as variables 


and r as a paraméter. Such a graph will show how the reliability of 


80 


Figure 1. 


the test increases as test length increases. Dividing 
denominator of the right-hand side of equa 
(15) "nem ree . 
2s, LS 
KF 
3 
Subtracting each side from 1 and simplifying gives 


l1—r 
E 
(16) 1—-Es , 
l-r 
K+ 
5 
Which ean be converted readily into the form of the 
hyperbola 
1 — ir A 
a7) E E tar 
i r 9 


Diagram for equation 14. Shows the augmented reliability (X) as a 
function of the original reliability (r) for various changes in test length (K). 


the numerator and 
tion 10 by r gives 


rectangular 


— 
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ei 


Figure 2. Diagram for equation 17. Shows the augmented reliability (R) as a 
function of the inerease in test length (K) for different initial reliabilities (r). 


It can be seen that as K approaches infinity, approaches unity for 
all values of r. Also as R approaches negative infinity, K approaches 
the vertical asymptote (1 — r)/r. This equation is graphed in Figure 2. 
However, since negative test length and negative reliability coefficients 
have no meaning, this part of the graph is omitted. 

For the purposes of preparing à computing diagram for equation 10, 
both equations 14 and 17 have the disadvantage of being composed of 
curved lines. This necessitates the computation of a great many points 
for each line and the use of an arbitrary smoothed curve for the inter- 
mediate points. If equation 10 can be changed into a straight-line form, 
it is necessary to have only two points for each line, which means that a 
practical computing diagram can be readily constructed by anyone 
who has occasion to compute a large number of values using equation 10. 

If we take equation 10, take the reciprocal of both sides, divide the 


right-hand side by Kr, and then subtraet unity from each side, we have 


ws bankas} 
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If we regard the left side as the ordinate, the expression in parentheses 
as the abscissa, and then give K the values 2, 3, ---, etc., we get the 
diagram shown in Figure 3. It should be noted that the measured 
distances on the ordinate and abscissa are proportional to (1/R) — 1 
and (1/r) — 1, respectively, but the numbers recorded along the ordinate 
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50 
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30 

7 
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Ficure 3. Diagram for equation 18. Gives a linear computing diagram for equa- 


tion 10, the gencralized Spearman-Brown formula 3 l= b: C - 1) 
R K : 
and abscissa are R and r, respectively. This graph show 
obtained by increasing the length of the test, when t 
the unit test is .5 or greater. 


ys the reliability 
he reliability of 


5. Length of test necessary to attain a desired reliability 
Equation 10 gives the reliability of the lengthened test as a function 
of the reliability of the unit test, and the length of the new test. How- 
ever, the same equation also shows how long a test needs to be made 
in order to have a specified reliability. For example, on a trial run a 
given test has a reliability of .80, and we wish to construct a test with 


[a 
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a reliability of .90 or slightly larger. If the increased reliability must 
be obtained solely by lengthening the test, is it necessary to double, 
treble, or quadruple the length of the test? We can answer this question 
by putting .80 for 7, .90 for R, and solving for the one remaining un- 
known, K, in equation 10. If we follow this procedure, we find that the 
test must be slightly more than doubled to get a reliability of .90, 
whereas trebling the length of the test would give a reliability between 
.92 and .93. 

Equation 10 can be changed to show the test length needed for any 
given reliability. An explicit solution for K can be obtained most 
readily from equation 18. If we divide through by the left-hand side 
and multiply both sides by K, we have 

EN d= ni)fkk 

Eae n 

on E = (= Rear 
The length of test necded for any specified reliability can be read from 
any of the graphs. Probably Figure 2 is best for this purpose since K 
appears here as one of the v 
crease the reliability of a test from r to R, the 


ariables. 


In order to in 


- (=k 
number of items should be multiplied by ü- Ry : 


6. A function of test reliability that is invariant with respect to 


changes in test length 
When we compare reliabilities for tests of different lengths, it is 
important to state clearly the precise question to be answered. We may 
ask: “Is this 20-item test (just as it stands) more or less reliable than 
this 100-item test?” In order to answer this question, the reliability 
coefficients of each of the tests should be compared. If the 20-item test 
has a reliability of .81, and the 100-item test a reliability of .87, the 
longer test is the more reliable test. However, we may wish to ask: 
“Tf the 20-item test could be lengthened to 100 items, how would it 
then probably compare with the 100-item test?" It should be carefully 
noted that this question implies that the test constructor can get 80 
other items comparable with the 20 now in the test, and that the students 
this type with essentially the same sort of 


can answer 100 items of : ] 
performance they are able to give for 20 items. Substituting K = 5 


and r = .81 in the equation 


R 


Kr 
"Ip (KE — ir 


^ 
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we have 4.05 
R = — = .96. 
4.24 


That is to say, if the 20-item test can be lengthened to 100 items without 
undue fatigue for either the item writers or the test takers, it will have | 
a reliability that is definitely greater than the .87 of the present com- | 
peting 100-item test. However, it is still true that, as things stand, the | 
short test has a reliability of .81 and the longer one a higher reliability | 
of .87. 

For some general comparison purposes, we may wish to compare 
test reliabilities, allowing for length of test, but may also feel that it 
is rather arbitrary to reduce all the tests to any specified length, such as 
50, 100, or 200 items. Since all the reliabilities approach unity as the 
test length is increased, we easily see that, if we choose the 200-item test 
as the standard, all the reliabilities will be much closer together than if 
we choose the 50-item test as the standard. Also the statistical sampling 
problems become difficult to work out for various extrapolations of 
different amounts. It is possible to devise a quantity that depends on 
test reliability and number of items and is invariant for changes in test 


length, as long as the test reliability increases, with length, according 
to the Spearman-Brown formula. 


Let us use the following notation: 


Riz, designates the reliability of a test of length L, 
Ry y designates the reliability of a test of length K. 


The problem then is to find some function, P, such that F(L, RL) 
- FR, Rxx). First let us express the reliabilities of each lengthened 
test in terms of the reliability of the unit test (r) 


R Lr 
16) ee 
1+ (L— 1)r 
i, ( a ) 
ee Kr 
1+ (K — Dr 
Taking the reciprocal and deducting one from each of these equations, 
we have 
1 1 — 
S + (L — 1)r E 
Ru Lr i 
(21) 
1 21 y 


Rex Kr 
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Simplifying the right-hand member of each equation and multiplying 
through by the number of items, we have 


m rde 


The number of items multiplied by one less than the recip- 
rocal of the reliability is invariant with respect to test length. 


This, then, is the function mentioned at the beginning of this para- 
graph, a function of test length and reliability that is invariant with 
respect to test length, provided that test reliability increases with length 
according to the Spearman-Brown formula. It furnishes a method of 


comparing two tests without making any arbitrary decisions regarding 
the length of test at which the comparison is to be made. It may also 
be that the sampling theory for such a value can be worked out more 
readily than that for the reliabilities that have been increased by use 


of the Spearman-Brown formula. 


7. Summary 
The: general formula f 
expressed as follows: 
Let one set of tests be designated by 
h=1- K), 
let the other set of tests be de 
Gaof-2t, Bates 
s is a standard deviation of one of the unit tests, 


r is the correlation between two of the unit tests, 


Rr is the correlation of the sum of K tests with the sum of Z other 


tests. 


Rxz can then be written as 


or the correlation of any two sums may be 


the subscripts g or h (g = 1 ++ K, 


signated by the subscripts G or H 


a function of the 7’s and s's, 


K L 
* XD reas 


' EE 
(0 Rei —rg UD L EPI 
2 
a> Y. rasch q| 22 56 + 22 27 reusesu- 
iei gel h=1 Gel G=I H—I 
(G#M) 


(g#h) 
in terms of average variances and eo- 


Writing the foregoing expression 
ge by a bar over the term, we have 


variances and denoting the avera 


Equation (7) KL@sa8e8e) 
£t 


7OVEGSPEK — Diane) V Ded") + LL — lori) 
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It should be noted that equations 6 and 7 are general and precise. No 
assumptions regarding parallel tests or any other limitations on the 
nature of the tests were made in deriving them. They are based on 
simple algebraic transformations and will be verified, barring arith- 
metical errors in the work. 

The relationship between test length and reliability has been shown. 
Solving explicitly for reliability as a function of test length, we have 
Kri 


(10) Hkk = 


= ——— (the Spearman- 
1+ (K — Irn 


Brown formula). 

Solving explicitly for test length as a function of reliability, we have 
1— RKK 

ie ( r11) KK. 


(19) Nom likkx)ru 


It was shown that 


(22) K (z - 1) 


is a function of test length and reliability that can be expected to show 
no systematic changes in value as test length increases. 


Problems 


1. From the Spearman-Brown formula, write a formula that will show the test 
length necessary for any specified reliability. 


2. There have been several articles dealing with the experimental verification of 
the Spearman-Brown formula. Study these articles; then summarize and comment 
on them. (See articles by Holzinger and Clayton, 1925; Ruch, Ackerson, and Jack- 


son, 1926; Furfey, 1926c; Wood, 1926c; Remmers, Shock, and Kelly, 1927; Kelley, 
1925; and Gordon, 1924.) 


3. State clearly the assumptions made in deriving the Spearman-Brown formula. 
4. 


Standard | Number | Relia- 
Test | Mi 
= san Deviation | of Items | bility 


A 14.7 100 .92 
B 27.9 10.6 50 .94 
[o] 10.5 4.1 20 .88 
D | 33.4 10.1 60 .89 
E 12.3 3.4 30 .53 


Chap. 8] Effect of Length on Reliability (General Case) 87 


(a) What will be the reliability of test A if it is lengthened to 300 items? 

(b) Estimate the reliability of test B if 25 items are added, making a 75-item test. 

(c) How long would test C need to be to have a reliability of .95? 

(d) Suppose that for test A, we were satisfied with a reliability of .85. How many 
items would be required for this lower reliability? 

(c) How many items would be required to give test D an index of reliability of .90? 

(f) Estimate the index of reliability of test E for triple length. 

(g) How many items would be required to give test E a reliability coefficient of .90? 


B. Read and comment on the material in Guilford, Psychometric Methods (19360), 
page 419, on the Spearman-Brown formula. 


6. What is the reliability for a test of infinite length? 


9 


Effect of Test Length 
on Validity (General Case) 


1. Meaning of validity 


Reliability has been regarded as the correlation of a given test with 
a parallel form. Correspondingly, the validity of a test is the correla- 
tion of the test with some criterion. In this sense a test has a great 
many different "validities." For example, the ACE Psychological 
Examination has one validity for predicting grades in English and a 
different validity for predicting grades in Latin. It is also found in 
studying various validity coefficients for a given test that they vary 
from school to school, and from time to time. In other words, validity 
cannot be regarded as a fixed or a unitary characteristic of a test. As 
new uses for a test are contemplated, new validity coefficients must be 
determined; and, when use of a test is continued, the validity coeffi- 
cients must be redetermined at intervals. In the remainder of this 
chapter we shall refer to “test validity" only in the sense that we are 
considering the relationship between test length and its validity for 
predicting a specified criterion. In most practical investigations of a 
test, we should be comparing several different validity coefficients. 


2. Effect of test length on validity 


The general formula for the correlation of any two sums may also be 
utilized to determine the effect of test; length upon test validity. We 
shall first consider the case in which the criterion variable is not altered. 
In this case L equals I, since we do not consider the effect of lengthening 
the criterion. Let Ra. «yt be used to designate the correlation between 
a test of length K and the original criterion variable. In this case the 
general formula (equation 6 of Chapter 8) for the correl 


ation of any two 
sums becomes 


K 
X TgISgSI 
g=1 


(1) Rari = —rz x à 
25 s D D reser Vee 
g=1 g=1 h=1 

(g =h) 
88 
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The s; in numerator and denominator cancel, and the expression re- 
maining may be written in terms of averages as follows: 


= " Xa 
2 MORI —— . 
CEU AK) + K(K — Yens) 


Again it should be noted that this expression is precise. It involves 
no assumptions whatever., The numerator is an average involving a 
number of validity coefficients. The denominator involves the reliabil- 
ities of the unit tests. However, when we have the data for this formula, 
we can actually compute the validity of the lengthened test. It is 
necessary to estimate the validity of the lengthened test only when the 
data necessary for equation 2 are not available. It is reasonable in 
such a case to assume that the values found for the first test are a rea- 
sonable approximation to the average values that would be found for 
all the unit tests. As indicated before, this assumption must be true 
if we succeed in making the new unit tests parallel to the original unit 
test. Setting these assumptions down explicitly, we have . 


rui = Craso), 
(3) a? (E), and d 
WIR OR 
TS” = (PehSeSh)- 


Substituting equations 3 in equation 2, we have 


Krus 
4 ny = e 
(4) Rak Ks) R(K — Dnus 


y s VK and using Rgı to 


Dividing both numerator and denominator b; 
ch is K times its original 


indicate the correlation of the new test (whi 
length) with the original unit criterion, we have 


(5) RE (K— Dri’ 


alidity coefficient, 

cient of the unit test, 

ficient of the unit test, and 

the test is increased in length. 


Where 721 is the augmented v 
rır is the validity coeffi 
111 is the reliability coe 
K is the number of times 


Multiplying the length of a test by K increases the validity 
coefficient as shown in equation 5. 
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By comparing the preceding equation (equation 5) with equation 10 of 
Chapter 8, we see that the multiplier for the validity coefficient is the 
square root of that for the reliability coefficient. That is, augmenting 
a test so that the reliability will be multiplied by 1.44 will only multiply 
the validity by 1.2. Since the validity coefficient is usually consider- 
ably smaller than the test reliability, this usually means that changing 
the length of a test can be expected to have only a very slight effect on 
the validity of the test. 

In order to see readily the effect on validity to be expected from 
increasing test length, equation 5 can be simplified by dividing both 
sides of the equation by rır (the validity coefficient of the unit test) and 
then dividing both numerator and denominator of the right-hand side 
by WK. This procedure gives 


Rx 1 . 
mu  ViI/K+(0—1/K)yru 


(6) 


Squaring both sides and taking reciprocals, we have 


Es "nr? 1 ( = 

(7) ma x X xw 

That is, the ratio of the squared validity coefficients is equal to a linear 
function of the reliability coefficient. It can be easily verified from 
equation 10 of Chapter 8 that the reliability of the unit test divided by 
the reliability of the augmented test equals this same linear function of 
the reliability coefficient. Equation 7 is graphed in Figure 1. The ratio 
of the squared validity coefficients is plotted as the ordinate, and the 
reliability as the abscissa. The appropriate straight line is shown for 
several selected values of K. The graph is read by entering at the bot- 
tom with the known reliability of the unit test and then moving up to 
the selected value of K, and then horizontally out to the right-hand 
margin. For example, as shown by the dotted lines in the figure, if a 
test with a reliability of .5 is doubled in length, the ratio of the squared 
validity coefficients is .75. That is, the squared validity for the doubled 
test will be one-third larger than the validity of the unit test. The 
validity coefficient will be increased by 1.3333, or 1.16; doubling a 
test with a reliability of .5 will increase the validity coefficient by 16 
per cent. 

By simply changing the scale markings to give the square root of the 
reciprocal, we can read directly the ratio of the augmented to the orig- 
inal validity coefficient. These values are indicated on the scale directly 
under the heading Rgr/rır. We see immediately from the graph, for 


——— 
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example, that, if the test reliability is greater than .5, making the test 
infinitely long increases the validity by less than 41 per cent. 

Since the same graph gives directly the ratio of the original to the 
augmented reliability coefficient, as can be seen from equation 10 of 
Chapter 8, an additional scale has been added at the extreme right 
giving the ratio of the augmented to the original reliability coefficient. 
This scale is given under the heading Ryx/riy.. By comparing the scale 
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relative increase in validity 


ram for equation 7. The 
ginal reliability and the amount of increase in 


Figure 1, Computing diag j 
or reliability as a function of the ori 
test length. 

validity coefficients, we see that an increase in 


for the reliability and i 
e relabilty coefficient increases the validity 


test length that doubles the reliability 


coefficient by only 41 per cent. . 
It is also possible to add another portion to the graph of Figure 1 in 


order to include the original and augmented reliability coefficients in- 
stead of merely their ratios. Such a diagram is Figure 2. The easiest 
way to read this graph is first to find the validity coefficient of the unit 
test (rjr) in the scale at the extreme right and then to place a ruler on 
the horizontal line for rg. Next enter the bottom left-hand scale with 
the value of the reliability coefficient for the unit test, go up to the 
radiating line appropriate for K, right to the heavy vertical center line, 
down to the ruler (previously placed to indicate rir), and then up to the 
value of the augmented validity coefficient. In the illustration shown 
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(Figure 2), the validity of the unit test is .6, and the reliability of the 
unit test is.7. If this test is trebled in length, its new validity will be 
between .70 and .65 (about .67). 


3. Length necessary for a given validity 
Sometimes in planning new tests it is desirable to know how much a 
test must be lengthened in order to achieve a specified validity coeffi- 
cient. It might be noted parenthetically here that, before inquiring 
how much the test must be lengthened to achieve a given validity, we 
might investigate the effect of making the test infinite in length. We 
do this simply by using the lowest of the radiating lines (marked K = c) 
in the left half of Figure 2. This topic is also discussed in a later sec- 
tion of this chapter, Section 5 (validity for a test of infinite length). If 
making a test infinitely long will not achieve the desired validity, we 
know that simply increasing test length will never achieve the desired 
validity. However, if the test of infinite length has a validity higher 
than desired, we see what would happen if the test were only 20, 10, 5, 
Figure 2 can be used entering it with the 


or 3 times as long. Again ; 
known validity and reliability coefficients for the unit test and the de- 
sired augmented validity. Then the K-value necessary for such an 


augmented validity can be determined. t i 
In order to check the approximate result obtained with the graph, 


and also for more precise calculation, it is desirable to have an equa- 
tion that gives K as a function of Rxt, ri and r. Such a formula can 
readily be obtained from equation 5. Squaring and multiplying through 


by the denominator, we have 


(8) Rp + KruRri? — rii = Kru’. 
Solving equation 8 for K gives 

- pay 

8) sii me — tubs : 


where the terms have the same meaning as In equation 5. 


Equation 9 gives the test length (IK), necessary for a specified 


validity (Rx). 
The graphical representation of equation 7 given in Figures 1 and 2 
also holds for equation 9. Since equation 9 is written explicitly for K, 
it is more convenient to use if we wish to know the length necessary for 
a desired augmented validity coefficient. A zero value for the denom- 
inator of equation 9 indicates that the test must be made infinite in 
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length in order to achieve the desired Rgr. A negative value for the 
denominator indicates that the desired validity cannot be achieved by 
lengthening the test. 


4, A function of validity that is invariant with respect to 

changes in test length 

In the preceding chapter, which treated the effect of test length on 
reliability, we found one function that did not change with test length; 
see equation 22 of Chapter 8. Similarly, if we wish to compare different, 
length tests with respect to validity for predicting a given criterion, it 
is desirable to have some function involving validity that does not vary 
with test length. Dividing equation 10 of Chapter 8 by 71; and taking 
the square root, we have 


Rrr _ 5 K 
(10) = NTE = 


Substituting the left side of equation 10 for the radical in equation 5, 
we have 


RKK 
(11) Rai = "I fes. 


"a 


Similarly, by analogy, if the test had been lengthened L times, we should 
have 


V Ri 


(12) Ru = ni 


from which it follows that 


(13) Ree Ru Tr 
VRkk VR Vin 


The ratio of the validity coefficient to the index of reliability 
does not change with increase in test length. 


It should be carefully noted that the foregoing relationship between 
validity and reliability holds only when validity and reliability are 
altered by changing the length of the test, without varying the nature of 
the items in the test. There must be no change in the variability of the 
group taking the test. For a discussion of the changes in reliability and 
validity with changes in the group heterogeneity, see Chapters 10, 11, 
and 12. Chapter 11 includes a discussion of the relationship between 
validity and reliability as the standard deviation of the group changes. 
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5. Validity of a perfect test for predicting the original criterion 

We can extend equation 5 to estimate what would happen to the 
validity coefficient of a test if it were made infinitely long while the 
criterion measure remained the same. We shall need to assume, of 
course, that in lengthening the test each of the new unit tests is parallel 
to the original one. That is to say, they each have the same mean and 
standard deviation as the original test, and the same reliability and 
validity. ` 

If we let K become infinite in equation 5, it gives the indeterminate 


form œ/%. However, if we first divide the numerator and denominator 


by A/K, we have 
"nu 


14 Re = -m 
(14) KI [6 32 
K g/" 


If we let K approach infinity, equation 14 simplifies to 
ru 

15 ta =" 

ve a Vm 


nade infinitely long and hence perfectly reliable, 
dicting the original criterion measure will 
ficient divided by the index of reli- 


If a test is ? 
dts validity for pre 
be the original validity coe, 
ability for the original test. 

That is to say, if it is possible to increase à test in length without 
limit, and to do so by adding only parallel forms of the original test, the 
augmented validity can never be higher than indieated in equation 15. 
This equation is a convenient one to use where we desire to know if it is 
worth while to attempt to improve the validity of a test by simply in- 
creasing its length. This equation is much easier to apply than equa- 
tion 5, and, if the test has a fair reliability, equation 15 will show that 
increasing the length of the test (even to infinite length) will not appre- 
ciably affect its validity. If the reliability of the test is reasonably low, 
we find that a reasonable increase in validity can be made by length- 
ening the test. Then it is relevant to ask how mueh longer the test 
should be made. Will doubling or trebling the length of the test prob- 
ably give a sufficient increase in validity to be worth the effort? Equa- 
tion 5 can be used with K, taking various values, such as 2, 3, 4, to see if 
a practicable increase in test length would change the validity suffi- 
ciently to be worth while. It may be remarked here parenthetically 

from equation 5 is that the increased validity 


that the usual conclusion t 
obtained from increasing the length of a test is negligible. 
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Equation 15, showing the effect on the validity of making a test 
infinitely long, has been graphed in Figure 3. By looking up the valid- 
ity at the bottom of the graph, and then moving up to the diagonal line 


WW 
WF AnE 
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Ficure 3. Computing diagram for equation 15 or equation 17. Correlation between 
two measures when one is increased to infinite length, while the other remains 


unaltered. 
Y TH Ti 
(15) Barm ——) or (i7) Rio 
TH YV'"YX 


for the appropriate reliability, we can read from the column at the right 
the expected validity of an infinitely long test. We see from this graph 
that, if the validity of a test is greater than the square root of its 
reliability, the expected validity for infinite length is greater than unity. 
This is an unreasonable situation. If any actual figures show a validity 
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greater than the square root of the reliability, the result may be re- 
garded as a fluke of some sort that cannot be relied on to repeat itself. 
Such a result should lead us to check computations very carefully to 
see if any error has been made, even though such results can arise with- 
out errors in arithmetic. Indeed, in general the reliability of a test 
should not merely be greater than the square of the validity coefficient 
but should also be greater than the validity coefficient. In most prac- 
tical situations the validities of a test run much lower than its reliability. 
Tests with reliabilities in the nineties, have validity coefficients that 
might range from .70 to .30 or lower. The dotted lines in Figure 3 show 
that if the reliability and validity of the unit test are .7 and .5, respec- 
tively, lengthening the test cannot inerease the validity above .6. 

As remarked previously, application of the equations showing the 
probable effect of lengthening a test upon its reliability and validity will 
indicate that not much improvement is to be expected from increasing 
the length of the test somewhat. However, since altering length does 
not have much effect on reliability and validity, this frequently means . 
that if we have a very good test, it is possible to shorten it considerably 
without seriously damaging its validity or reliability. Equation 5 may 
also be used with fractional values for K to determine the effect of cut- 
ting a test to one-third or one-half its present length. It can be seen 
that shortening a reliable test by one-third its present length will have 
little effect on reliability and validity and may well be considered if the 


reliability is already over 95. 


6. Validit 


perfect criterion 
Equation 14 may be regarded as showing what happens to the cor- 


relation between two measures when one of them is increased to infinite 
length and the other remains unaltered. It was indicated that the test 
length was increased, while the criterion measure remained unaltered. 
The same formal relationships would hold if the test remained the same, 
while the criterion measure were increased in length. Thus by symme- 
try we can use Rix to represent the augmented validity coefficient and 
711 to represent the reliability of the original criterion measure, and 


y of the original unit length test for predicting a 


write 
"I 
(16) Rık = 1 1 1 
ds SF (s = =) TU 
K K 


Equation 16 shows the increase in test validity as the cri- 


terion measure is increased in length. 
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If K approaches infinity in equation 16, we have in the limit 


Til . 
VIII 


If the criterion measure is made perfectly reliable by being 
made infinitely long, the validity of the original test for pre- 
dicting this criterion measure will be the original validity co- 
efficient divided by the index of reliability for the original 
criterion measure. 


(17) Rio = 


7. Effect of altering length of both test and criterion 


If we wish to consider the effect of lengthening both the test measure 
and the criterion measure, the general formula for the correlation of 
any two sums applies with very little alteration. We may begin with 
equation 7 of Chapter 8, 


KL(rsosssa) 
V K(sg) + K(K — irmis) WL) + L(L — Hensen) 


and make the following assumptions: 


1. Since the various forms of the test are parallel, s, = sj. 

2. Since the various criterion units are parallel, sg = sy. 

3. The average validity coefficient Tag = rır (the validity of the 
original unit test). 

The average test reliability rą, equals the reliability of the original 

unit test (r11). 


The average criterion reliability rag = rrr (the reliability of the 
original unit criterion measure). 


4. 


5. 


Making these substitutions and simplifying gives 


KLr 
(18) Rr, 2 : 
VE FER- Dri VL Y L(, — ly 


Dividing the numerator and denominator through by KL gives 


Tiu 


Rei —————————————2À 
1 1 1 NU 
mme» d+ —z)m 


where Ex; = the validity of the test augmented K times fi 
the criterion augmented L times, i 


(19) 


or predicting 
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the validity of the unit test for predicting the unit crite- 
rion measure, 

vj = the reliability of the unit test, and 

rir = the reliability of the unit criterion measure. 


Tir 


This formula and many variants of it were given by Spearman (1910). 


Equation 19 is the general equation showing the correlation 
of a test K times as long as the original test with a criterion 
measure L times as long as the original one. Equations 10, 
Chapter 8, and 5, 15, 17, and 21 of Chapter 9 are special 
cases of equation 19. 


If we begin with equation 19 and set L = K, we obtain equation 10 
of Chapter 8; set L = I, we obtain equation 5 of Chapter 9; set L = I 
and K = «, we obtain equation 15 of Chapter 9; set K = 1 and L = c, 
we obtain equation 17 of Chapter 9; and finally, by setting L = K = v, 
we obtain equation 21 of Chapter 9. 

If in equation 19 we divide by rir, square both sides, and take the 


reciprocal, we have 


e [16-3156 72" 
E ne —c—Jm A ck ees agr | 
(20) x E +(1 K mr + z)™ 


We see that equation 20 is essentially the same as equation 7, except 
that the right side of equation 7 is one linear function, whereas the right 
side of equation 20 is the product of two linear functions. This means 
that the graph of Figure 1 can be complicated somewhat to serve for 
equation 20. In Figure 4 the lower left-hand section gives the value 
for the left-hand bracket of equation 20; the upper right-hand section 
gives the value for the right-hand bracket of equation 20; and the 
lower right-hand section gives the product of these two. For example, 
if the criterion reliability is .6 and L is 2.0, we enter the upper right 
section of the graph with these two values, as shown by the dotted lines, 


and mark with a ruler the appropriate radiating line in the lower right 
section of the graph. For the foregoing values of .6 and 2.0, this line is 
the dashed line PO. Leaving the ruler here for a marker, we enter the 
lower left graph with the values of K and the test reliability. The 
dotted lines illustrate this procedure for the case in which the test re- 
liability is .7, and K is 3.0. Thus we see that, if the criterion reliability 
is .6 and the test reliability is .7, then if the criterion measure were 
doubled and the test were tripled, the new validity coefficient would be 


slightly more than 20 per cent above the old one. 
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Again, as was emphasized in Chapter 6, we must note that, in order 
to increase the “test length" effectively both the number of items and 
the test time limits must usually be altered. Also there must be no 
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FIGURE 4. Computing dia 


gram for equation 19 or equation 20. 
increase in validity due to 


The relative 
lengthening both measures by any speci 


fied amount. 


tions of Chapters 6 7, 8, and 
have the same mean, error of measurement 
and that all the intercorrelati i 

these criteria are met, 
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parallel tests, and the equations for the effect of test length apply. A 
statistical criterion for parallel tests is given in Chapter 14. 

The problem of how to adjust the relative lengths of several tests to 
maximize the validity of the composite has been solved; see Horst 


(1948) and (1949), and Taylor (1950). 


8. Validity of a perfect test for predicting a perfect criterion 
(the correction for attenuation) 

If both K and L are allowed to approach infinity, several of the terms 
will vanish, giving the formula for the correlation of a test of infinite 
length (or unit reliability) with a criterion of infinite length (and hence 
of unit reliability). This equation is 

ut 


21 Rea $ 
ep V/runi 


This is the well-known correction for attenuation. It is not properly 


called a "correction." Rather it is an estimate of the correlation be- 


tween a perfect test and a perfeet criterion. 
This formula was given by Spearman (1904a), (1907), (1910), and 
(1913). 
A correlation coefficient “corrected for attenuation” (It, s) 
may be regarded as (a) the correlation between true scores in 
each of the two measures and (b) the correlation between the 
two measures when each is increased to infinite length (and 
hence given a reliability of 1.00). This correlation is equal 
to the correlation between the original measures divided by the 
geomelric mean of the two reliability coefficients. 


st in the correction for attenuation rose from the belief 
" correlation between two variables, unattenuated 
fallible (unreliable) measuring instruments. It 
was thought that one of the sources of variation in observed coefficients 
of correlation was variation in reliability of tests used. "Therefore, if 
coefficients were corrected for attenuation, there would be greater 
agreement between different, experiments. With the development of 
factor theory, it became clear that, although variation in reliability 
was one source of disagreement between the results of different, experi- 
menters, there were many other important sources. In particular, each 
test has a fairly high specific factor which is not duplicated in other 
similar tests, and therefore would be a source of variation in the results 
of different investigators. Also the factor composition of many, if not 
most, tests is complex, and the variation in factor structure from one 


Initial intere 
that it gave the “true 
because of the use of 
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test to another is another possible source of variation in results of dif- 
ferent investigations. Although the notion that the correction for 
attenuation would give invariant results despite fallible tests does not 
seem to have been borne out, the equation is still valuable in giving a 
quick indication of the worth-whileness of attempting to increase valid- 
ity by increasing test length. 

It will be noted that equations 21 and 15 are analogous. In equa- 
tion 15 the validity is divided by the square root of one reliability co- 
efficient. If the result is divided by the square root of a second reliabil- 
ity coefficient, we have equation 21. Thus two graphs like that of 
Figure 3 will give a computing diagram for equation 21. These two 
graphs are shown in Figure 5. Enter the graph with the correlation of 
the two tests on the scale at the lower left, move up to the diagonal 
representing one of the reliability coefficients, then to the right to the 
diagonal representing the other reliability coefficient, and then up to 
read the result from the scale at the upper right. The dotted lines 
represent this procedure for the case in which the correlation between 
the two tests is .5, while the reliability coefficients are .8 and .9. 
this case the correlation between true scores is about .59. 

We should note carefully just what we are doing when using this 
equation. It is an estimate of the correlation between test and criterion 
if both could be made perfectly reliable by lengthening the test and the 
criterion measure indefinitely. Just because we might get a validity of 
-90, for example, by lengthening the test and criterion does not mean 
that we have such a validity coefficient with the original test. How- 
ever, if the coefficient of validity “corrected for attenuation” is near 
unity, it does show that the major problem to work on for better pre- 
diction is the most appropriate means of increasing reliability of test 
and criterion measures. If (when corrected for "attenuation") the 
validity coefficient is still in the neighborhood of some reasonably low 
value, such as .6 or .7, we can conclude that further work in that par- 
ticular field should take two directions. First, it is desirable to try to 
improve the reliability of both criterion and test measures. The coeffi- 
cient corrected for attenuation shows the maxim 
reasonably hope for by such efforts. If this validity is still a consider- 
able distance from unity, we can also look for new tests to add to the 
prediction battery. If used in this way, to determine what work should 
be done next—whether to search for new tests and also to improve the 
reliability of tests already m use, or only to improve reliability of tests 
now in use—the correction for attenuation is a valuable tool in direct- 
ing future research. However, the use of this correction for the sole 
purpose of being able to report a higher validity coefficient —accom- 


In 


um validity we can 
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panied by the implication that this higher coefficient has already been 
achieved—is distinctly misleading and erroneous. When Spearman 
(1904a) first presented the correction for attenuation, it was vigorously 
criticized by Pearson (1904) and others. An excellent discussion of the 


correct and incorrect use of this formula has been presented by Thouless 
(1939). 


9. Summary 


Several equations have been developed showing the relationship 
between test length and validity. The most general equation enables 
us to estimate the correlation between two tests, when one is increased 


to K and the other to L times its original length. Using Rxz to desig- 
nate this correlation, we have 


(19Y Rg = = , 
Ks «( ‘) 1 ( » 
K ^UE EH b o 


where 7}; is the reliability of the original test, 


71 1 is the reliability of the original criterion measure, and 


Tır is the correlation between the two measures (the validity of 
the original test). 


All the other equations derived in Chapter 9 are special cases of this 
one. 


If one of the tests is increased in length, while the other remains un- 


altered, we have the Special ease of equation 19 in which L — 1. If we 


use Fi to designate the new validity, obtained by lengthening only 
one of the measures, we have 


(5) Dd ME 
V1i+ (K =I 


ility of the measure that is to be increased 


ability of the test that remai i I 
does not enter into the equation. w* i 


Equation 5 may also be 


, t written to show explicitly thë amount of 
Increase in length necessa: 


mar ty for any desired new validity. This equa- 
(9) KA Rell ri) 


2 
Ti — Rki T 
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Tf the test denoted by subscript 1 becomes infinite in length, whereas 
the test denoted by I is unaltered, we have the special case of equa- 
tion 19, in which K is © and L is I. The resulting correlation (A 1) is 
given by 


(15) Rer = vv 


If the test designated by I is made infinite in length, whereas the test 
designated by 1 is unchanged, we have the special case in which K is 1 
and Lis æ. The corresponding equation is 


3 
(17) Rens 


"I 


It was also shown in developing equation 13 that the foregoing expres- 
sions, equations 15 and 17, do not change as the length of the test 
changes. Thus, for comparing the relative performance of tests that. 
vary in length, the validity coefficient divided by the index of reliability 
may be used. 

The correlation between true scores or between measures, each of 
which has been made infinitely long (and hence perfectly reliable), is 
the special case of equation 19, in which both K and L become infinite. 


This correlation (Raw) is given by 


"Wr e 
(21) Ran = m (the eorreetion for 
marl attenuation). 
Problems 


^ 


1. Under what conditions can the validity of a test be equal to its reliability? 


2. Prove that the validity of a test can never be greater than its reliability. What 


assumption was used? 

ng the relationship of test length to validity and to 
ynship between test reliability and validity as the 
hile the criterion is unchanged, that is, write 


3. From the equations showi 
reliability, determine the relatio 
length of the test is increased wW 

f(r, 715 REK, Rre) = 0. 
in frec-answer form are rewritten as multiple- 
the frec-answer form is .88 and the reliability of 
hat correlation would be expected between the 


4. Fifty mathematics problems 
choice items. If the reliability of 
the multiple-choice form is .93, w 
two forms? 
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5. If the reliability of a test is raised from .80 to .90 by lengthening the test, a 
validity coefficient of .60 for this test would be expected to increase to what value? 


6. 
Validity 
Standard | Number | Relia- Criterion 
Test | Mean | Deviation | of Items | bility | (School Grade 

Average) 

A 16.5 4.4 30 .72 .68 

B 12.6 3.5 20 Erud .50 

[9] 53.2 10.7 100 .88 .68 

D 32.3 8.8 50 .91 ATL 

E 66.3 17.2 120 .95 5 


(Criterion reliability — .70) 


(a) If test A is lengthened to a 100-item test, what would you expect the new 
mean, standard deviation, reliability, and validity to be? (Assume that the 
criterion test has not been altered.) 

(b) If test B is lengthened to increase its reliability to .90, how many new items 
will be needed? What will the new validity be, assuming that the criterion 
test remains unchanged? 


(c) Which of the five tests is, in its present form, best for use in predicting school 
grade average? 

(d) Which of the five tests seems to be intrinsically closest to this criterion? 

(e) If there is time and material for a 200-item test in each case, which of the 
tests would probably perform best at the new length? 

(f) If the reliability of the criterion is .70, what is the correlation between true 
criterion scores and true test scores for each of the tests? 

(g) If it were possible to improve the criterion by methods analogous to increasing 
test length, so that the criterion reliability were raised from .70 to .90, what 
would be the new validity of test C in its present form? 


(h) To raise the criterion reliability from .70 to .90 corresponds to an increase in 
criterion test length of about how many times? 


(i) Give the true variance, and error variance for test C. Estimate the true and 
error variances for test C if it is increased to 300 items. 

(j) If test D is increased to a 150-item test and the criterion test is doubled in 
length, estimate the new reliabilities and validity. 


(k) What will be the validity of test E if its reliability is increased to unity? 


7. Test X has a validity coefficient of .65 and reliability .75, whereas the validity 
of test. Y is .67 and its reliability .95. Each of these tests is a 50-item test. Which 


type of item (that in test X or in test Y) would probably show the greater validity 
for a 200-item test? 
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8. Prove that, if a test of n items is a subtest of a test with m items (n. < m), the 


correlation ram is 


Tnm = 


where r is the reliability of a unit test. 
9. Derive the equation for the correlation between (rue scores in two different 
: a "ror "e 0 a 7 Q Y , 
tests. Use only the assumption that “true score plus error score equals gross scor e, 
that “all correlations with an error score are zero,” and some appropriate definition 


of parallel tests. 


10 


Effect of Group Heterogeneity 
on Test Reliability 


1. Introduction 


The correlation between two variables is markedly affected by the 
range of the variables. For example, if we correlated height and weight 
for a group of persons who ranged from 5 feet, 6 inches, to 5 feet, 8 inches 
in height, we should find that the correlation is very low, as illustrated 
in Figure 1. It is, of course, unlikely that we should make such a pecul- 
iar selection of persons for the purpose of correlating height and weight. 
However, the effect would be similar if, for example, we were to correlate 
height and weight for pupils in the fifth grade, as compared with cor- 
relating height and weight for pupils in grades one to twelve. The 
correlation between mental age and chronological age will be much 
greater for a school population than for a given grade. 

In a similar manner, restriction of range lowers a reliability coeffi- 
cient. If a mental test is given to a random sampling of children aged 
Six to sixteen, the range of scores will be very great, and the reliability 
coefficient will be high. If, on the other hand, the test is given to a 
group of eighth-grade students who have a rather narrow mental ability 
range, the reliability coefficient will be much lower. 

By making certain reasonable assumptions, it is possible to estimate 
the amount of change in reliability that will result from any given change 
in the group variance, Also by solving the equations for variance it is 
possible to estimate the amount of change in variability it would have 
taken to produce any given change in reliability. 

First let us recall that the observed variance of a test has two com- 
ponents. It is the sum of the true variance and the error variance. Tt 
is possible to increase the observed variance by inereasing either the 
true variance or the error variance. In all the illustrations given above, 
for example, it was the true variance that changed. That is to say, the 
actual mental ability range of a group of six- to sixteen-year-olds is 
greater than the mental ability range of a group of twelve-year-olds. 

108 
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In general, when we give a test to two different groups and find that the 
standard deviation of one group is larger than that of the other group, 
we are dealing with a case where the true variance of one group is 
greater than that of the other group. The only other event that could 
account for the difference in group standard deviations would be an 
alteration in the error variance. This would mean that the test was 


Ficure 1. Illustration of effect of changes in group heterogeneity on correlation. 


given under good conditions in one group and under poor conditions in 
the other group. It is to be noted that ordinary errors in test procedure, 
such as allowing too little time (erroneously calling time 10 minutes too 
soon, for example), would affect everyone’s score in the same direction, 
and would not bring about an increase in the error variance of the test. 
An increase in error variance means that the scores of some persons 
were raised considerably above where the score should have been, 
while for other persons the score was considerably lowered. "This sort 
of effect might be produced, for example, if the room was large and the 
students in the first rows received good directions and some help in 
answering the questions, whereas the students in the last rows did nót 
understand the directions and. did not receive any special help so that 


their scores were lower than they would have been under the standard 
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' devised and used. A solution to this problem has been given by Green 

(1950b). 

Otis and Knollin (1921) pointed out that the error of measurement 
was superior to reliability as a test statistic. Kelley (1921) also recog- 
nized some of the disadvantages of the reliability coefficient and the 
advantages of the error of measurement. He discussed, and suggested, 
some solutions for the problem of establishing a suitable unit for the 
error of measurement. Basic statistics on any test should include the 
error of measurement as well as the reliability. 

It should be noted that learning can also have the effect of increasing 
or decreasing the test score variance for a given group. In general, if 
we began testing the group when they knew very little, all scores would 


be low, the mean would be low, and the variance would be low. We 


should say that the test was too difficult for this group. As the members 
of the group learned more, the 


average score and the score variance 
would increase for a time. Then as learning continued beyond this 
stage, we should eventually find that the test was too easy for the group. 
All persons would make perfect or near perfect scores; hence the mean 
score would be high, and the standard deviation of scores would be 
small again. 

Copeland (1934) has pointed out that teaching a class so that the 
Students begin to approach a perfect score will lower not only the test: 
variance but also the test reliability. It is also clear that, if the test 
were initially too difficult for the group, so that the scores were uni- 
formly near zero, we should expect the first effect of learning to be an 
Increase in the test variance, and hence an increase in test reliability. 
It must be emphasized, however, that the effect of such changes in test 
variance are not related to the discussion presented in this chapter. 
There 8 no reason for believing that this effect is the equivalent of 
selecting the members of a group in such a way that the true variance 
will be altered and the error variance unaffected. As we approach the 
floor or the ceiling of a test, the error variance is clearly affected, but 
the theory presented in this chapter has nothing to do with such effects. 
The theory presented here is based on the assumption that we are work- 


ing entirely within the appropriate range for the test and that no “floor” 
or "ceiling" effects are present. 


Figure 2 is set up in terms of 


Sx IS 
(6) zr | = Tze 


Sz 4 1— RET 


From it we can read the proportional change in standard deviatioi 
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corresponding to a given change in reliability, or the change in reliability 
that corresponds to a given ratio of standard deviations. 

For example, if rrr is .64 and xx is .91, we locate .91 across the top 
of the graph, .64 along the left side, and note that these two lines inter- 


Rxx 
19 36 51 64 75 84 91 96 99 10 


5.00 


Figure 2. Computing dingram for equations 5 and 6: 


2 


EE M 
a 1 — Rxx 


sect on the diagonal line labeled 2.0. This means that a change in re- 
liability of .64 to 91 would occur if the observed standard deviation 
were doubled and the entire increase were due to a change in true var- 
iance. In a similar manner, we observe that an increase of 25 per cent 
in the standard deviation, if due entirely to a change in true variance, 
will be expected to raise the reliability coefficient from .75 to .84. 
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i in reliability with a change in group 
Graphs for computing the change in reliability g 
esca have also been presented by Rulon (1930), Cureton and Dun- 
lap (1929), and Toops and Edgerton (1927). 


3. Effect of changes in error variance on reliability 


Occasionally students are confused by the fact that in the usual 
formulas for correlation, the standard deviation appears in the de- 
nominator. They point out that, if the denominator of a fraction is 
increased, the fraction is decreased; therefore, the argument runs, “an 
increase in standard deviation should be expected to lower the reliabil- 


ity.” If we take the equation for the true variance (see equation 20 of 
Chapter 3), 


(7) SË sj 


we may divide through by s,? and write the reliability coefficient as 


© Ee c 


I =a ER: 
S; SE + 8," 


Thus we see that the only way for the denominator to increase while 
the numerator remains constant is for the true vari 
stant. This necessarily means that the entire c 
increase in the error variance. 
of a test changes, owing 
reliability of the test will 


ance to remain con- 
hange must be due to an 
It is true that, if the observed variance 
solely to an increase in the error variance, the 
decrease. It is possible to derive the equation 
showing the relationship between observed standard devi 
liability on the assumption that the true vari 
so far as I know, has ever reported a case w 
reasonably be used. However, we m 
to emphasize the fact that an increase in observed standard deviation 
will have the effect of increasing reliability, if it is due to an increase in 
true variability, and will have the efi 


fect of decreasing the reliability if it 
is due to an inerease in error variability. 


If the true variance of the test, for group z is equal to that for group X, 
we may write the two equations for true variance (see equations 39 of 
Chapter 2 or 20 of Chapter 3) 


ation and re- 
ance is constant. No one, 
here such an equation could 
ay derive the equation here simply 


(9) Siz = SV r,, and 


a0) Si. = Sx V Rxx. 


t 


at 
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Since the true variances are in this case assumed to be equal, we may 
write 


(11) u Vis = S V Rrr. 


Squaring both sides and dividing through by Sx? gives 
2 
(12) Rxx = Terg t 
where the terms have the same definition as in equation 4. 


If s,2 is the variance of group x on a given test, and Sx? is 
the variance of group X on the same test, and if this differ- 
ence is attributed solely to a difference in error variance 
for the two groups, the reliability for group X (Rxx) is given 
by equation 12. 


From equation 12, we easily see that, as the variance (Sx?) increases, 
the reliability (Rxx) decreases. The change is very drastic. For ex- 
ample, if the standard deviation (Sx) is double the standard deviation 
(sz), the reliability (Rxx) is one-fourth the reliability (rze). 


It should again be pointed out that the assumption that a S 
change in observed variance is due entirely to a change in 

error variance ts very unreasonable. It will not occur except 

with very peculiar and very careless testing methods. 


4. Conditions under which the error of measurement is 

invariant with respect to test score 

The derivations presented in this chapter have depended in general 
on the assumption that the error of measurement is constant for a given 
test regardless of the variability of the group to which the test is given. 
This assumption will be true in general only if the error of measurement 
is the same regardless of the magnitude of the test score. Does the 
error of measurement change in some systematic fashion as the magni- 
tude of the test score changes? To study this problem analytically, 
we may proceed by expressing the error of measurement as a function 
of test score. The solution presented in this section is the one given by 


Mollenkopf (1948), (1949). e. 
Let us regard the score for individual 7 as made up of two equivalent 


parts, for example, 


(13) ti = vau d tiz. 


3s 
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The subscripts 1 and 2 designate equivalent halves which are such that 
(14) $1 = $5, 

where 


Sgt N (g = 1,2); 
(15) o! = a" 
and " 
(16) B =B", 
wherc 
N 
2 za? 
, _ i=l 
D Ns? 
07; 
N 
br: ti? 
al = i=1 
Ns 
and 3 
N 
25 md 
i=l 
p= 
Ns,* 
(18) 


Using the relationship shown in equation 10, Chapter 6, we may write 
the standard deviation of the total test (S+) as follows: : 

(19) S = 28? (1 + r), 

where 


N 
225 Vivi 


i=l 


Nsis2 
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From equation 30, Chapter 6, the reliability of the total test (R) may 
be written as 
2r 


2 


BE) 


(20) R 


From equation 37, Chapter 3, the error of measurement (Se) may be 
written as 
(21) S? = S(1 — B) 
To express S, in terms of sı and r, we substitute equations 19 and 20 in 
equation 21 and simplify, obtaining 
(22) S2 = 25*(1— 7). 
Using equation 14, we may write the standard deviation of the differ- 
ences (t; — tiz) as 
N S 
b (zi — rio) 
€ = 234*(1 — r). 


(23) 


Thus we see that K 


L (ea — ti)? 


J el 


(24) S2 = i 


error of measurement for the total test is equal to 
n of the differences of parallel halves. Thus we 
difference (ra — c2)" as the "error" for indi- 
f these errors as the error of measurement for 
ror for individual by 


The square of the 
the standard deviatio: 
may regard the squared 
vidual 2, and the sum 0} 
the test. Let us define this er 
(25) y; = (a — t2)". 


Let us now approximate yi as closely as possible by à second-degree 


function of x, designated 
(26) ji = ax? + bri + 6 


where a, b, and c are chosen 50 that 
= 2 )» (y; — az? — bx; — oy 
MEM E Jr E DUE s 
(7 Swat Et 


is a minimum. 
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To minimize this expression, differentiate successively with respect to 
a, b, and c, and set the derivatives equal to zero. This procedure gives 
the equations 


N N N 
a ró +b rte r= 
iat ici i= ; 


N 


N N N 
(28) a+b arte D a= cus 


i=l i=1 i=1 i=l 


N N N 
aJ zê +b Yat eN = Èy: 
i=1 


i=l i=l 


The solution for a, b, and c in determinantal form is 


Dry Xo Dr? 
Zy Be Ix 
zy Dr N 
uis Za xS D2? 
Zi De? Dr 
Zi? De N 
Dt Bry xs 
Ze Bry Dr 
(29) - Zi X N 
Za x De 
Zi? Xa? Dz 
XS 3e N 
Za* B Bry 
Zi? r? Bzy 
P Zi? de Xy 
Za* m3 w 
2 Xa? ^ Xx 
Zi te N 
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The problem now is to express a, b, and c in terms of the moments of 
Fit 2 
the a-score distribution. The terms Er, Ex, Zz, Ez, and Ly can be 
expressed readily as follows: 


(30) Sy c 

(31) Sz? = NS, 

(32) Xs = NSSax, 

(33) Xat = NSz*Bs, 

and f 
(34) Sy = NS2(1 — R), 

where 


(35) gu = 


(36) Br 


Jetes the solution, except for the expressions Day and Dry. 


This comp 4 
ions for these terms as functions of the mo- 


We must now find expression 
ments of the x-score distribution. 
In order to do this, it is necessary 
and fourth-degree product moments, 
2 


Xov RE 
Drix), and Exi». 
implif 2y. ri follows. We consider the regres- 
To simplify Zx,” x2, We proceed as . " 
write v2 as the sum of the score predicted from zi 


sion of x9 on tı, and v Mee ue de 
and the residual error designated c2. This gives 


82 
(37) w= # 9 Tı + ep. 


2x, and noting that sı = s2, we have 


first to find the value of the third- 
Ear. (or Vayxe”), Zeita (or 


Substituting this value for za in pu 
(38) Za. = rEn? + Xen. 
Let us assume that 


(39) Temi? = 
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From the gross score formula for correlation, 
EXY — NXY 
| VEE — NX3)ZY? — NY)’ 
we note that if rz, = 0, then ZXY = NXY. Thus, Pe = 0, 


" Ze] [22:7 
Zem? = N [Ze] Wl 


Since Ze» is zero, it follows that 


(40) - Zea? = 0. 


Try 


Substituting equations 17 and 40 in equation 38, we have 
(41) Zar, = Nrs,Fe’. 

To evaluate Xxx, we assume that 

(42) Tez. = 0, 


and by a corresponding procedure find that 


(43) Brit? = Nrs,3a’. 


. 10 


To simplify 22,32, we proceed by substituting for a2 from equation 


37, noting that sı = s», and writing 


(44) Zr t = TEx; + Bez’. 


As before, if we assume that 
(45) 


Temi? = 0, 


Seo? = N [=] EA 
2; wilw 


Since Lee is zero, it follows that 


it follows that 


(46) Zest,’ = 0; 


hence, substituting equations 18 and 46 in equation 44, we have 
(47) Exit = Nrsj!B'. 
In like manner, by assuming that 


(48) ene 
we may show that 


(49) Irix? = Nrs,4g’, 
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To simplify Sx,7x2”, we again use equation 37, note that (so/s, = 1), 

and write 

(50) Eze? = Eve) + e)”. 

Expanding and taking the constants outside the summation, we have 

(51) Xs? = SX nj? 2rEnje. 

If we assume that 


(52) 


then 


"T En] [Ze? 
I 
Eae? = 1 $ 
N N 


We see that the first term in brackets is the variance of xı and the sec- 


ond is the variance of the error made in estimating z» from xı. Thus 


we may write 

(58) Zee? = Ns2s (1 — 7°). 
Substituting equations 18, 46, and 53 in equation 51, noting that sı = so, 
and simplifying, we have 

(54) Ire = Ns (p + 1 — 7), 

e the skewness index for the half test (o^) as a func- 


Let us now writ 5 : 
From equation 13; 


tion of that for the total test (az) 


(55) Sa = D(x au). 


uation 35 and expanding equation 55, we have 


Using eq 
(56) NS ča: = Sa? + 3521x + 8Da1%2" + Za? 
Substituting equations 15, 17, 41, and 43 in equation 56 and simplifying, 
we have 
(57) Sar = 2s)?o (1 + 3r). 
Solving equation 20 explicitly for 7, we have 
R 
$5 "7 E—E 
Solving equation 19 for sı and substituting from equation 58, we have 
S2(2 e. R) 


(59) Si n 
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Substituting equations 58 and 59 in equation 57 and simplifying, we 
have 


,A+R)V2—-R 
(60) Qr = a ——d 
from which we may write 
" 2a, 
an "T FRV R 


To write the kurtosis index for the half test 8' as a function of the 
corresponding index for the total test 8,, we use equation 13, and write 


(62) Zrt = Z(v +a). 
Using equation 36 and expanding equation 62, we have 
(63) NSz*B, = Ex + 42x 3xo + 62r r? + ADxyxo3 + Xn 


Substituting equations 16, 18, 47, 49, and 54 in equation 63 and simpli- 
fying, we have 


(64) S48, = 2s,!['(1 + 4r + 3?) + 3(1 — 79]. 
Substituting equations 58 and 59 in equation 64 and simplifying, we have 
1+R\ 30—R) 
9 ner ate 
(65) Bz = B ^m ans E 
Solving explicitly for 8’ gives 
2 3(1 — R) 
66 re gal 949 
on , 5 k + D 1+R 


We now have all the equations needed to solve for Zxy and Za?y so 
that equation 29 can be expressed entirely in terms of moments of the 
total score distribution and the reliability of the test. 


Multiplying 
equation 13 by 25, we have 
(67) Exy = D(z, + tə) (tı — to)”, 
which expands to give 
(68) Zzy = Tay? + Ex — Yaya — Drizz. 
Substituting equations 15, 17, 41, and 43 in equation 68 and simplify- 


ing gives 


(69) Zzy = 2Ns ĉa (1 — r). 


? 
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Substituting equations 58, 59, and 61 in equation 69 and simplifying, 
we have 
70 Sry = NSS G z 5) 
7 Dry = az | —— }- 
(70) y saint a +R 
To express Xa?y in terms of moments of the total score distribution, 


we use equations 13 and 25 to give 


(71) Sx2y = D(a + 22)?(a1 — 23)". 
Expanding, we have 
(72) Dx2y = Eni Xa! — 22r, wa. 


Substituting equations 16, 18, and 54 in equation 72 and simplifying, 
we have 

(73) moy =-2N's,4(8’ — 1)(1 — 7°). 

Substituting equations 58, 59, and 66 in equation 73 and simplifying, 


we have 
1-R 


2 = NS," z — 2 + R). 
(74) Za?y = NS. (—)e + R) 


Substituting equations 70, 74, and 30 to 34 in equations 29 and sim- 


plifying, we have n 
(1— R)(8; — 3 — e) 


77 cR: —1-— a2”) 
(1 — R)28:o; 
(75) P = GTR) —1— es) 


(1 — R)S (BR — aR +2 — R) 
m ( + R)(8: — 1 — a£) 


and c in equation 26, we have 


Using these values for a, 5, 


a =i e [(Bx 3 ar Ja? 
(1 + R)(Bz — 1 — a) 
+ 28,a,x + (BR — ax R + 2 — R)S,?]. 


(76) y= 


When a, is zero, that is, in a symmetrical distribution, we have 
x ; 


(—R) sg, — 3)? + (BR +2 — R)S2}. 


m i= 
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When kurtosis is equal to that of the normal curve, that is, 8; = 3, we 
have 


Equation (78) 
WES 
9 04 Ra) 


{— at? + Srat + (2R — o? R + 2)8,7}. 


For a symmetrical (e, = 0), mesokurtic (8; = 3) distribution, we have 
(79) y = S2(1 — R). 


In this case the error of measurement is constant as test score varies. 

We see from these equations that for the case of zero skewness and 
kurtosis of 3, the average error of measurement is the same regardless 
of the score. However, for distributions that are positively or negatively 
skewed, or for a kurtosis greater or less than three, we should expect 
the error of measurement to vary with the magnitude of the test score. 
In addition to presenting the theoretical derivation given above, Mol- 
lenkopf (1948), (1949) has presented empirical verification to show that 
the error of measurement does vary in general in accordance with the 
indications of equation 76. 


5. Summary 


The effect of group heterogeneity on test reliability has been derived 
on the assumption that the error variance is the same for the two groups, 


the entire difference in observed variance. being attributed to a difference 
In true variance of the two groups. 


Solving explicitly for one of the standard deviations, we have 


1 "zz 
(4) Oris, zc. 
l— Rxx 


Solving explicitly for one of the reliability coefficients gives 


6) = 
Rxx =1— sg (1 — rz). 


; 2: : 
In these equations s;? is the variance of group x on a given test, 


at is the reliability of the same test for group c, 
x 


is the variance of group X on the same test, and 
Rxx is the reliability of the same test for group X. 


A computing diagram for these equations is shown in Figure 2. 
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Mollenkopf (1948), (1949) has shown that, if we assume the test has 
been divided into two parallel halves so that 


(14) 8j = $85, 
(13) LO 
(16) g = B"; 


and if the errors of estimate are unrelated to the independent variable 
so that 


(39) Tet = 0, 
(42) Tait = 0, 
(45) Tag? = 0, 
(48) Taz = 0, 
(52) Tag = 0; 


then the best fitting quadratic to express the error of measurement as 
a function of test score is 


a=% M. 
T p m-————————.— —sx 8. — 8 — ex) 
CO 9701 mg, -1- a 
+ Szat + (bR — o? R + 2 — R)S], 


where R is the corrected parallel halves reliability, 
S, is the standard deviation of the distribution of test scores, 
a, is the skewness of the distribution of test scores, and 
Bz is the kurtosis of the distribution of test scores. 


According to this derivation the error of measurement is constant 
d t score if and only if the test score dis- 


with respect to variation in tes 

aur is m ry 
tribution has a. skewness of zero and a kurtosis of three. The error of 
measurement has a minimum for a leptokurtic and a maximum for a 


platykurtic distribution of test scores. E ^ 
1t should be noted particularly that equation 76 follows. from the 
assumption that the two halves qo Y Sanini cg vs 
à 2 at is, from equations 14, 15, and 16, and from the 
inia H didis 39, 42, 45, 48, and 52). It also should be 
noted that, if the conclusions of equation 7 6 do not apply in any given 
case, it must follow that one oF more of the foregoing cieki assumptions 
do mob hold for that case. That is, either the halves "id for computing M 
reliability were not parallel, the errors of estimate eorrelated with the 
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squares or cubes of the independent variable, or the squares of the 
errors of estimate correlated with the squares of the independent 
variable. 

According to Mollenkopf's derivation, if the distribution of test scores 
has zero skewness and kurtosis of three, the error of measurement is 
invariant with respect to changes in magnitude of test score. The 
error of measurement has a minimum for a leptokurtic and a maximum 
for a platykurtic distribution of test: scores. 


Problems 


1. Assume a set of test scores each of which has been divided into comparable 
halves for purposes of obtaining a split-half reliability. Designate these halves by 
ta and xp, d = za — xp (the difference between a person's score on part a and part b). 
The total score on the test is the sum of the halves (s = Ta + rp). Assume that the 
halves are comparable so that their means and standard deviations are identical, 


(a) Write rz,z, in terms of the standard deviations of s and d. 
(b) Write the reliability of the total test in terms of the Standard deviations of s 
and d. 


(c) Express the variance of d in terms of the reliability coefficient and the variance 
of test scores. 

(d) Assume that selection of cases occurs by rejecting persons with high or low 
total scores, which would have the effect of changing the variance 
altering the variance of d. Write the new reliability coefficient in terms of the 
old reliability coefficient and the two total score variances, 

(e) Show that the standard deviation of d is the error of measurement of the test. 


of s without, 


2. In one study of tests, A, B, C, D, and E, the following results are obtained: 


Test | Mean Standard | Number | Relia- 
á Deviation | of Items | bility 


A 18.4 4.2 30 -72 
B 28.9 9.8 60 -96 
Cc 37.2 8.1 50 .90 
D 63.7 10.4 100 .86 
E 39.2 11.5 75 .92 


(a) Another investigator reports administering test A to a new 
a mean of 25.3 and a standard deviation of 8.4. About what reliability would 
you expect the test to have for this new group? 

(b) It is reported that test B has be 
ability coefficient is only .90. 
lowered reliability without 
Scoring? 


group and finding 


en administered to a new group and the reli- 
What would account satisfactorily for this 
indicating any faults of test administration or 
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(c) 


(d) 


(e) 


(f) 


Test C is administered to a new group with the following results: mean 31.9, 
standard deviation 12.7, reliability .96. Are these results in reasonable agree- 
ment with those reported in the table for test C? 

Test D is reported as having a mean of 68.2, a standard deviation of 14.6, and 
a reliability of .98. Are these results in reasonable agreement with those 
reported in the table for test D? " 

If the report on test D also stated that the reliability was based on a corrected 
odd-even correlation and that the time allowance for the test had been changed, 
would you infer that the time allowance had been increased or decreased? 
(A brief survey of the chapters on experimental methods of determining relia- 
bility and on speed versus power tests may help answer this question.) 

A teacher wishes to use test E for sectioning a class, and finds a mean score of 
45.3 and a standard deviation of 3.9. What comment would you make on this 


proposal? 


3. Study the equation for estimating a change in reliability due to & change in 
group variance given by Dickey (1934). Comment on this equation. 


4. Write Davis’ (1944) equations for the special case in which the “restricting 
variable "is “true score." 


il 


Effect of Group Heterogeneity 
on Validity (Bivariate Case) 


1. Illustrations of selection 


In addition to affecting the reliability coefficient of a test, the hetero- 
geneity of the group tested will also affect the validity coefficient. For 
example, if in Figure 1 the abscissa represents a test and the ordinate a 
criterion, the validity coefficient for the total group will be much greater 
than the validity coefficient for the restricted portion of the group 
included between the two dotted lines. The validity coefficient would 
be lowered in a similar manner if the selection were made upon the basis 
of the criterion variable. 

It should be noted that here again we are assuming that the change in 
variability is due mainly to a change in true variance. The actual per- 
sons at the upper or the lower end of the scale are removed, which 
means that the true variance is lowered. In this section we shall not 
consider the case in which there are changes in observed variance due 
to changes in error variance. As pointed out in Chapter 10, such an 
assumption is quite unreasonable. 

In considering the effect of selection of cases upon the intercorrelation 
between two tests it is important to note that this effect will vary with 
the nature of the selection procedure. In any practical situation the 
actual selection procedures are usually complex, and to a great degree 
unknown. We can only investigate the situation and make the most 
reasonable guesses possible regarding the selection procedure operative 
in any particular instance. 

For example, if an intelligence test is given to all applicants for admis- 
sion to a college, and only those with a score greater than 0.5 sigma are 
accepted, we have a clear case of selection on the basis of the test. Simi- 
larly, if a business concern uses a selection test and accepts only the 
upper 60 per cent, we have a clear-cut case of selection on the basis of 
the test. Usually such clear cases do not occur. The college admits all 
students with a score over 0.5 sigma, provided they do not have poor 
grades or a bad recommendation from their high school principal. 

128 
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Likewise, the college may reject all applicants with a score of less than 
0.5 sigma unless they have an exceptionally good high school record 
and excellent recommendations. In most if not all practical situations 
it is impossible to find out just what weighted combination of the avail- 
able variables was used for selection. In many cases, however, it is 


Score Y 


Score X 


Figure 1. A diagrammatic illustration of the change in correlation with a change 
in standard deviation. 

clear that a given selection test was one of the major items in the selec- 
tion procedure so that the results found by assuming that selection was 
solely on the basis of the test will not be far from the correct estimate. 

There may also be times when the criterion itself is the selective 
device. For example, if we are using a given test to predict college 
grades, it may be that students with a grade less than C or less than D 
are dropped. Then selection 1s clearly on the basis of the criterion. 
Likewise if a manufacturing concern wishes to develop a test that will 
predict a given production record, and dismisses employees if the pro- 
duction record falls below a given minimum standard, we have a clear 
case of selection on the basis of the criterion. 

In any practical situation there are usually several selection devices 
at work. ‘This fact may suggest that, if we considered the case of selec- 
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tion on the basis of two or more variables (multivariate selection), we 
should have equations that would be more applicable to practical situa- 
tions. Careful investigation of selection procedures show, however, 
that there are numerous extenuating circumstances that sometimes 
override the strict selection rules. In other words, the equations given 
to “correct for selection," whether univariate or multivariate selection, 
are only approximations to the practical situation. The equation that 
will give the closest approximation to the selection facts of a particular 
situation is the one to use. 

In short, there is no completely satisfactory substitute for a well 
set-up experiment, in which a number of selection tests are given to a 
group. All of the group without any selection are then admitted to the 
training program and given the criterion measure under identical condi- 
tions. If these conditions are observed, the necessity of “correcting for 
homogeneity” is completely avoided, and we can make very simple and 
straightforward comparisons of the relative merit of different selection 
procedures. However, many practical situations arise in which it is 
not possible to have such complete control over the experimental condi- 
tions. If we are dealing with such situations and wish, for example, to 
compare the validities of two tests, one of which was used for selection 
and the other was given after selection, the simple zero-order validity 
coefficients are definitely misleading. I£ is necessary either to make a 
correction for the results of selection, taking care to select the equations 
that are most nearly appropriate for the case in hand or to use some 
index, such as the error of estimate, that is not affected by the selection 
procedure. 

In the material that follows we shall consider only the case of uni- 
variate selection. We shall consider first the bivariate case, in which 
we are interested in the correlation between two variables (X and Y) 
and selection has been on the basis of one of these variables. Next we 
shall consider the trivariate case, where we are interested in the correla- 


tion between Y and Z, when selection has been on the basis of a third 
variable (X). 


2. The distinction between explicit and incidental 
Considerable confusion and error hay 


distinguish carefully between two types of selection. We have first the 
direct selection of cases on the basis of a given variable. Those who are 
above the critical score are admitted, and those below t1 


i rea he critical score 
are rejected. This selection'is referred to as explicit selection. Second, 
we have an indirect selection effect on one variable brought about by 


! Multivariate selection is discussed in Chapter 13. 


selection 
e been caused by the failure to 
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explicit selection on another correlated variable. For example, if a 
given college rejects all applicants who are below the 50 parceniile 
score on the ACE Psychological Examination, the result will be the 
selection of a group of persons who would score high on the Ohio Psycho- 
logical Examination, even though the Ohio Psychological is never given 
to that group. This occurs because scores on the two examinations are 
positively correlated. We shall refer to the selection effect on the Ohio 
Psychological in this case as incidental selection. Explicit selection on 
a given examination results in incidental selection on all tests correlated 
with that examination. 

In order to avoid the confusion that has appeared in much of the 
literature on correction for the effects of selection, we shall treat the two 
cases separately. First, however, we shall consider the basic assump- 
tions common to both types of selection for the bivariate case. 


3. Basic assumptions for the bivariate case 

Figure 1 illustrates this case. Our problem is to find what parameters 
are invariant from the curtailed to the extended distribution. These 
parameters can then be used to bridge the gap from one distribution to 
the other. If selection is on the basis of the v variable, the regression 
line of y on x will not be systematically affected, and can be assumed to 
be the same for the curtailed and extended distribution. In Figure 1 
we see that the mean y for a given x is not altered by explicit selection 
on x. Since the regression of y on x is the line through these means, we 
see that, if the regression is perfectly linear, the assumption will hold 
exaetly. Also, from inspeeting the diagram, we see that explicit selec- 
tion on x will markedly alter the mean x for a given y and hence will alter 
the regression of x on y. If we designate the curtailed group by x and y 
and the extended group by X and Y, the foregoing assumption may be 


written by putting down the equation of the regression line of y on x 


for each group: 


Sy 
(1) Ü = Very "m T, 
: Sy 
(2) Y = Rxy rm X 


Since it is assumed that the predicted or average y (Y) for a given x (X) 


is the same in both cases, the slopes of the two regression lines are equal, 


and we may write 


(3) Tay (9) =f (=). 
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From inspection of Figure 1, we see that the dispersion about the re- 
gression line of y (Y) on x (X) will not be affected. That is to say, not 
only is the mean y for a given x the same, but the dispersion of y’s for a 
given x is the same in both groups.! A little geometric consideration 
will show that, if the selection is explicitly on x as shown in the diagram, 
the dispersion of the z's for a given y cannot remain the same for all 
values of y. In fact for each and every value of y (Y) in the middle 
range, the dispersion of x for the curtailed group is much less than the 
dispersion for the complete group. From the foregoing considerations, 
we see that, when there is explicit selection on z, the error made in esti- 
mating y from x is the same for both the complete and the curtailed 
group. We may thus write the expressions for the two errors of estimate 
and set them equal to each other as follows: 


(4) Sy; = Sy V 1 — roy), 
(5) Sy.x = Sy V1 — Rxy?, 
(6) svi T H — Sy VA E Rxy?. 


4. Variance known for both groups on variable subject to 
incidental selection 


In the usual ease of correction for restriction of range, we have com- 
plete information on one group (usually the .curtailed group) and we 
have, or can estimate to a reasonable accuracy, one of the variances for 
the other group (usually the more heterogeneous group). That is to 
say, in the typical case we have values for Try, Sy, Sz, and one standard 
deviation for the other group, either Sx or Sy. Unless we know these 
four values, the problem cannot be solved. 

Let us use the subscript x to designate the variable subjected to 
explicit selection and y to designate the variable subjected to incidental 
selection, as indicated in Figure 1. First we shall consider the case in 
which both variances are known for 1 , the variable subject to incidental 


selection. Here we have given Tam Sz, Sy, and Sy. The problem is to 
express Sx and Rxy in terms of these four given val 


ues. 
The solution for Ryy can be obt 


ained directly from equation 6. 


Squaring both sides and dividing by Sy? gives 

1 s 
7 mn Cu (7 E 
(7) Rxy (15$. s 


1 This assumption follows from the usual assumption of homoscedasticity. This is 
the assumption that the dispersion of y 


i t for a given x is the same regardless of the 
value of x, and is basic to many of the theorems of statistics, 
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Solving equation 7 explicitly for Rxy gives the final result 


m 
8 mu tr nx 
(8) XY ( IUE 


where rz, is the correlation between x and y for one group, 
sy is the standard deviation of the variable subjected to inciden- 
tal selection for the same group, and 
Sy is the standard deviation of the same variable for the other 
group—the group for which an estimate of the correlation 
(Rxy) is desired. 


If the variance of a variable subject to incidental selection is 
known for two groups (s,? and Sy? are known), and the cor- 
relation between the incidental and explicit selection variables 
is known for one group (rz, is known), equation 8 should be 
used to estimate the correlation between these two variables for 
the second group. 


Equation 8 or slight variants of it has been presented by Kelley 
(1923c), Garrett (1947), Guilford (1942), Crawford and Burnham 
(1946), Thorndike (1947), and others. 

It should be noted that nowhere in the previous derivations has it 
been assumed that Sy was greater than sy. In a broader sense the 
lower-case subscripts (z and y), which were originally assigned to the 
“curtailed group” as shown in Figure 1, may be taken to designate the 
group for which complete information is available. Then the upper-case 
subscripts (X and Y) designate the group for which only one standard 
deviation is available and for which additional information is sought. 
Usually we shall have complete information on the group with the 
smaller variance and shall wish to estimate the correlation for the un- 
restricted group, the one with the larger variance. The equations, 
however, are equally applicable if we have complete information for 
the unrestricted group and wish to estimate the correlation for various 
sorts of restriction. For example, we may know the validity of a given 
test in an unrestricted group and may wish to estimate the validity of 
that same test for use in a second university that has higher entrance 
standards, and hence gets a group of students with a larger mean and 
smaller standard deviation. 1 : m 

A computing diagram for equation 8 is shown in Figure 2. This dia- 
gram is set up in terms of equation 7. It shows the value of the ratio 
of the standard deviations (Sy/sy,) and the values of the two correlation 


coefficients. In order to use this diagram, we find the diagonal line 
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corresponding to the standard deviation of Y divided by the standard 
deviation of y; then we locate the correlation (r+) at the bottom of the 
diagram, follow up to the appropriate diagonal line, and then to the 
right to read the value of Rxy. For example, if the ratio Sy/sy is 1.2, 
and rz, is .45, the expected value of Rxy is .64. It is also possible to 


5 10 


x 89 
2.0 


J | a 77 


Sy 
Sy 
Rxy 


i, 63 
12 1 + | 1 I: 
I—H s 
32 
1.0 .10 
40 32 45 63 " 47 89 1.0 


xy 3 
2 Sy 

1-Rxy sa-ran ($4) 
Y 


FiGunE 2. Showing the change in correlation as a function of change in variance 


of the incidental selection variable. 


use this chart to find the ratio of standard deviations needed to account 
for a given difference in the correlation between a (X) and y (Y). Locate 
the value of the smaller correlation on the abscissa. 
relation on the ordinate, and note the diagonal line 
the intersection of these two lines. 

In constructing Figure 2 


, and the larger cor- 
that corresponds to 


» We measure the ordinate and the abscissa 
in terms of 1 — ra”, 1 — gy,? (from the upper right-hand corner). 
However, the ordinate (on the right) and the base line are labeled in 
terms of r,, and Ryy. The diagonal lines have been drawn to correspond 


Chap. 11] Group Heterogeneity and Validity 135 


to the variance ratio but have been labeled in terms of the ratio of 
standard deviations. 

In addition to estimating the correlation for the XY group, it is 
sometimes also desirable to estimate the standard deviation of the 
variable subjected to explicit selection. In our present notation, this is 
the standard deviation of X. It can be estimated readily by solving 
equation 3 for Sx, obtaining 
_ RxySysz 


(9) Sx 
TrySy 


If we square the foregoing equation and substitute the value of Rxy 
from equation 8, we have 


2 2 sy 
(10) Sx? = L = (1 — c : 
i Sy 
which simplifies to 
2 2 1 1 Sy? 
(11) Sx = Se L = Ser ea | 
Try Try Sy 


The terms in the brackets may be combined and the square root 
taken, giving 


2 3 je 8.2 
Sz V Sy” — Sy + Tey Sy 


(12) Sx = i 


TrySy 


where s, is the standard deviation of the variable subjected to explicit 
selection in the group for which complete information is 
available, 
Sx is the estimate of the standard deviation of the same variable 
in the group for which only Sy is available, and the other 
variables have the same definitions as for equation 8. 


If complete information (sz, Sy, and Try) is available for one 
group, and only the variance of Y (the variable subject to in- 
cidental selection) is known for a second group, equation 12 
should be used to estimate the variance of X (the explicit selec- 
tion variable) in the second group. 


5. Variance of both groups known for variable subject to 
explicit selection 
Let us turn now to the situation where we know the variance for both 
groups for the explicit selection variable. This case is much more com- 
mon than the one previously discussed. For example, if we give a selec- 
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tion test to a group of applicants, use this test score to select the upper 
K per cent of applicants, and then admit this upper K per cent to col- 
lege or to an industry so that performance records can be secured for 
the upper K per cent, we should then have the variance of the selection 
test for both the applieant and the selected group. "That is to say, the 
variance of both groups would be known for the explicit selection var- 
iable. This case occurs frequently so that the equations developed in 
this section will be the ones that are most generally useful in estimating 
the effects of selection on validity coefficients. 

As before, we shall use z or X to represent the explicit selection 
variable, and y or Y to represent the incidental selection variable, in 
accordance with the symbols of Figure 1. In the previous notation we 
have values for rzy, Sz, Sy, and Sy. The problem is to solve for Rxy 
and Sy in terms of these four known values. 

As before, it must be remembered that we are hot assuming that s; is 
smaller than Sy. The equations developed will apply when s; is smaller 
than Sx and will also apply when s; is larger than Sx. In the notation 
used here, the lower-case subscripts designate the group for which we 
have complete information (two variances and the correlation), whereas 
the upper-case subscript designates the group for which we know only 
the variance of the explicit selection variable. 

In order to obtain the equation for Rxy, we may first solve equation 3 
for Sy, obtaining 
(13) goes 

Rxysz 


and then substitute this value for Sy in equation 6, obtaining 
/7—3  ÜaSyX . > 
(14) Sy 1— Tat = ila 1— Rxy?. 
XY$8z 


Dividing both sides by sy, squaring both sides, and segregating xy on 
one side of the equation gives 


(15) LE = s^ — Ta) 
Rxy? Sx?^ra? 


The simplest way to graph this function is to divide both numerator 
and denominator on the right side by 7?, obtaining 


1 Ey eal 
(16) -1=— ( 1). 
Rxy? Sx? Nr? 
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'Thus we may express the two standard deviations as a ratio and graph 
the function as indicated in Figure 3. Solving explicitly for yy gives 


(17) 


2 
E (5-1 
Ray Sx Vy 


3. Showing the change in correlation as a function of the change in variance 


"IGURE i i 
Figur of the explicit selection variable. 


all the denominator over Sx^r,?, inverting, and taking the 


Putting 
square root gives 
Sxray 
Rxy = 23 2 E 
(18) XY VSx?ry UTER sry E 


where the terms have the same definitions as for equations 8 and 12. 
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If complete information (sz, Sy, and rz) is availablé for one 
group, and only the variance of X (the variable subject to ex- 
plicit selection) is known for a second group, equation 18 
should be used to estimate the correlation between X and Y 
for the second group (Rxy). 


Equation 18, or slight variants of it, was first derived by Pearson 
(19034). It has also been presented by Kelley (1923c), Holzinger (1928), 
Thurstone (1931a), Thorndike (1947), Crawford and Burnham (1932), 
and others. 

Sometimes it is also desirable to estimate the standard deviation for 
the second group, of the variable subject to incidental selection (the 
value of Sy). It may be noted that the value of Sy is given by equation 
13, except for the fact that this equation contains the term Rxy, which 
is not known. However, the value Ryy is given by equation 18. Sub- 
stituting equation 18 in equation 13 gives 


Trys, Sx 
(19) Sy = 
83S xTzy 


3 3 i 
Sx^ra + s; 


2.3 
— lr ay 
which simplifies to 


(20) Sy = s/V1-— ra? + ra (Sx ISA, 
where the terms have the same definitions as for equations 8 and 12. 


If complete information (s+, Su, and rz) is available for one 
group, and only the variance of X (the variable subject to ex- 
plicit selection) is known for a second group, equation 20 
should be used to estimate the variance of Y (the incidental 
selection variable) for the second group. 


6. Comparison of variance change for explicit and incidental 
selection 
In order to compare the change in variability of the variable on which 
there is explicit selection with the change in variability of the variable 


on which there is incidental selection, we can rewrite equation 20 as 
follows: 


Sy? Sx? 
(21) = = 1) =f, (SE — 1). 
s x ed 
The percentage of change in variance of the variable subject to 


incidental selection is equal to T+,” times the percentage change 
in the variance of the variable subjected to explicit selection. 
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It should be noted that, if both Sy and Sx are available, it is possible 
to check by means of equation 21 to see if the proper relationship holds. 
If this relationship does not hold, the selection probably was not entirely 
and consistently made on variable x. 


Try 
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Ficure 4. Computing diagram showing the relationship between the change in 
variance ratios for the explicit selection variable (designated by X and z) and the 
incidental selection variable (designated by Y and y). 


Figure 4 gives the relationships of equation 21. From this graph it 
is possible to determine the variance ratios of the explicit and incidental 
selection variables that correspond to any given correlation fey. The 
diagram indicates, for example, that, if the variance ratio for the explicit 
selection variable is 1.6 and the correlation rz, is .90, the variance ratio 
for the incidental selection variable is 1.49. 
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It should be noted that there is no need to rework the foregoing 
derivations for the case where variable y was subjected to explicit selec- 
tion. Simply let x (X) in the foregoing equations stand for the variable 
on which explicit selection occurred. It may be either the criterion or 
the test. Then note whether the variance of the complete group is 
known for the variable on which explicit selection occurred (Sx?), or 
for the variable on which incidental selection occurred (Sy?), and use 
the equations appropriate to the information available. 


7. Relationship between reliability and validity for incidental 
selection 


In Chapter 9 we saw that the ratio of the validity coefficient to the 
index of reliability does not change with increases (or decreases) in test 
length (see equation 13 of Chapter 9). Similarly, when considering the 
effect of changes in group heterogeneity, it is possible to find a relation- 
ship between validity and reliability that does not change as the het- 
erogeneity of the group changes. 

First let us consider this effect for the variable subject to incidental 
selection. As before, we shall use y to designate this variable and x to 
designate the variable subject to explicit selection. As noted in equa- 
tion 6, when y is subject to incidental and x to explicit selection, the 
quantity syV1 — r4? is invariant with respect to explicit selection on 
variable z. As indicated in Chapter 10, equation 3, the quantity 
Sy V l — ry, (the error of measurement) for y does not change as the 
heterogeneity of the group changes. 

Since each of these quantities is invariant with respect to changes in 
group heterogeneity, their ratio is also invariant. Dividing one quan- 


tity by the other, canceling the term Sy, and squaring the remaining 
fraction, we have 


(22) (^ m a, ] 
L Tz 
where C is arbitrarily used to designate this constant, 
Tyy is the reliability of the test y, and 


Try is the correlation between y and the explicit selection var- 
iable (a). 


If x is subject to explicit and y to incidental selection, then 


m E a do 3 
for any type of explicit selection on x the ratio -— is 
T 


constant. 
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8. Relationship between reliability and validity for explicit 
selection 
In order to obtain a relationship between r++ (the reliability of x) and 
Tzy (the correlation between x and y) that does not change with explicit 
selection on z, we make use of the following assumptions: 


1. The error of measurement for x is invariant with respect to explicit 
seleetion on x (equation 3, Chapter 10), 


SV 1 — Toz = Ci. 
2. The error made in estimating y from x does not change (equation 6), 


SyV 1 — ry? = €. 


3. The slope of the regression of y on x does not change (equation 3), 


Sy 
— zy = C3. 
Sr 


These three equations can readily be combined so as to eliminate the 
standard deviations of z and of y. The first equation multiplied by the 
third and divided by the second gives an expression in which the stand- 
ard deviations cancel out, leaving a constant that we may designate 


as C". 


c Tay VÀ — Tex 
em TOMATE T 


where 7; is the reliability of the variable subject to explicit selection, and 


Try is the correlation of this variable with any other variable. 


If x is subject to explicit and y to incidental selection, then for 


ici 1 th titi Tay V1 — fas 
any type of explicit selection on x the quantity Eu 
Er 


is constant. 


9. Summary 

The variable directly used for selection has been termed the explicit 
selection variable and designated by x or X. The correlated variable, 
which is affected only indirectly because of its relationship with the 
explicit selection variable, has been termed the incidental selection 
variable and designated by y or Y. 

The basic assumptions for the bivariate case are first that the slope 
of the regression of y on x is equal to the slope of the regression of Y 
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on X, which is given in equation 


and second that the error made in estimating y from x is the same as 
that made in estimating Y from X, which is given in equation 


(6) sy V1 — fu^ = Sy VA = Rxy’. 


There is no need to specify that one group is the extended and the 
other the curtailed group. The same equations apply either for estima- 
tion from a restricted to an extended range or from an extended to a 
restricted range. The convention was adopted that the lower-case 
letters should stand for the group on which complete information was 
available. In other words, it was assumed that Try, Sz, and sy were 
known. We then have two types of cases. 

First we considered the case in which the variance of both groups is 


known for the incidental selection variable (Sy is known). For this 
case, we have 


(8) Rxy = i =(= Ta P 

and ^ 

(12) fe cs sV Sy? — "PEE 
x ; 


TzySy 
Second we considered the case in which the variance of both groups 
is known for the explicit selection variable (Sx is known). For this case 
we have 
Syrr 
(18) Rxy = IU 
VSx?ru? F S — sr? 


and 


(20) Sy = sy V1— rg? F ry Sx!/s). 

Two computing diagrams were presented for these selection equations, 
Figure 2 for equation 8 and F igure 3 for equation 18. 

In order to demonstrate that the effect on the standard deviation of 
the explicit selection variable (x or X) was greater than the effect on 


the standard deviation of the incidental selection variable, this rela- 
tionship was shown in 


S 2 S „2 
(2) (= - 1) = ra? m. j 
Sy Sz 


Figure 4 was presented to illustrate this equation. 
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Just as there is a relationship between validity and reliability of a 
test that is maintained as the length of the test is varied (see equation 13 
Chapter 9), there is a different relationship between reliability ad 
validity as the heterogeneity of the group is varied. The relationship 
between the reliability of the incidental selection variable and its 
validity for predicting the explicit selection variable is given by 
(22) ecc equals a constant. 

1— ny 
The relationship between the reliability of the explicit selection variable 
and its validity for predicting the incidental selection variable is given by 


A ay VL tes 
Vlr 


Problems 


equals a constant. 


Assume that a population is selected on the basis of test scores in such a way as to 
change the variance of test scores. 

1. Write the equation for estimating reliability in the new population as a function 
of test variance in the new population, and the test variance and reliability for the 
old population. 

2. Write the corresponding equation showing the relationship between the test 
variances and validities for both populations. (The criterion is subject only to inci- 
dental selection.) 

3. Write the equation showing the relationship between test reliability and test 
validity for a given test, as the population is subject to selection on the basis of test 
score. (Suggestion: Eliminate variances in the two preceding equations.) 

4. Compare this relationship with the expected relationship between reliability 
and validity when test length is varied for a fixed population, 


Data FOR PnoBLEMS 5-10 


"rn Correlation 
Standar umber | Relia- | of Each Test 
Test Mean Deviation | of Items | bility | with the Same 
Criterion 

A 19.3 4.1 30 .74 .69 

B 10.2 4.2 20 .87 .70 

Cc 58.3 13.8 100 .92 Prid 

D 27.8 9.7 50 .95 .74 

E 68.1 24.8 130 .97 «75 
Criterion | 117.8 20.1 200 .90 
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5. Assume that selection has occurred on the basis of test A. 


(a) Estimate the validity that test A would have in an unselected group for which 
the standard deviation of test A was 6.7. " . 
(b) Estimate the standard deviation of the criterion for this unselected group. 


6. Assume that only persons with criterion scores over 90.0 were included in this 
sample, and that otherwise there has been no selection of cases. It is also known that 
for an unselected group the standard deviation of test A is 6.7. 


(a) Estimate the validity of test A for the unselected group. 
(b) Estimate the standard deviation of the criterion for the unselected group. 


7. A group of high-scoring persons on test B are selected, and it is found that the 
standard deviation of test B for this limited group is 3.2. 


(a) What validity would test B probably have for this restricted group? 
(b) What would be the variance of the criterion for this restricted group? 


8. Compare tests C and D on the assumptions that the data are from the same 
subjects, that no selection occurred on test C, whereas subjects were screened on the 


basis of test D scores from an unselected group with a standard deviation of 12.0 
on test D. 


9. What is the standard deviation of the difference between actual criterion 
Scores and criterion scores predicted from scores on test E? 


10. If we screen a group using test E scores and obtain 
standard deviation 15.0 on test E, what will 
estimate be for test E in this subgroup? 


a selected subgroup with 
(a) the validity and (b) the error of 


12 


Correction for Univariate Selection 
in the Three-Variable Case 


1. Introduction 


Let us consider the three-variable case. If there is explicit selection 
on test X, what effect will this have upon the correlation between two 
variables (Y and Z) subject to incidental selection because they each 
correlate with X? It may be noted that insofar as we are interested in 
the correlation between X and Y, or between X and Z, the equations 
for the bivariate case given in Chapter 11 will apply directly. Also, 
if we are given the variance of X and wish to obtain that of Y or Z and, 
conversely, if we are given the variance of Y or Z and wish to obtain ` 
that of X, the bivariate equations given in Chapter 11 may also be 
used. There is only one problem regarding variances that is unique to 
the three-variable situation: given the variance of Y, solve for that of Z, 
or vice versa. That is, we have the problem of how to express the vari- 
ance of one incidental selection variable in terms of the other incidental 
selection variable. The other problem that occurs in the three-variable 
but not in the two-variable case is determining the correlation between 
two incidental selection variables (the correlation Ryz). 


2, Practical importance of the three-variable case in 

univariate selection 

Let us consider a practical illustration of three variables with uni- 
variate selection. Suppose we are trying out a new test for the selection 
of college students. Let us call this new test, test Y. The students 
available for a validation study have already been admitted to college 
on the basis of selection on test X. Test Y is then administered to the 
freshman class, and the new test is correlated with college grades. 
College grades is the criterion score which for present purposes we may 
designate as variable Z. Since tests X and Y do not correlate perfectly, 
it is evident that all the freshman class will have passed test X (since 
it was used for admission), but some of the freshman class will fail 


test Y. That is, the range of S Y will be greater than the range 
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of scores on test X, since X is subject to explicit selection and Y only 
to incidental selection. Equation 21 of Chapter 11 and Figure 4 of 
that chapter show that the variance of the explicit selection variable is 
reduced more than the variance of the incidental selection variable. 
We are interested in comparing the validity of text X and test Y under 
similar sampling procedures. If an unselected group had been admitted 
to college, would the validity of test X have been higher or lower than 
the validity of test Y? 

To illustrate the importance of this problem, let us consider a hypo- 
thetical case in which two admissions tests (X and Y) have the same 
validity, the same correlation with grades (Z). That is, both tests 
have the same validity for an unrestricted group. Then for the restricted 
group, the validity of the test actually used for selection (the explicit 
Selection variable) would always be lower than the validity of the test 
nol used for selection (the incidental selection variable). In other words, 
if we considered only the zero-order uncorrected correlation coefficients, 
we should always reach the conclusion that the test not being used for 
selection was better than the one being used. When X was used 
for selection, Y would have the higher validity (uncorrected) ; and, when 
Y was used for selection, X would have the higher validity. 

It follows that, whenever a test is already being used for selection, 
it is relatively easy to try out another test on the selected group and find 
that it is better than the test already in use. Evidence that a selection 
program already in use is not as good as a new one proposed is not con- 
vincing if we use only the uncorrected validity coefficients. It is neces- 
sary to use validity coefficients that have been adjusted for the effects 


of selection. Some of the appropriate equations were presented in 
Chapter 11; the others will be given here. 


3. Basic assumptions for univari 
variable case 


Let us designate these three variables as X To 
how selection on the basis of variable X 
between and the variance of y and Z. Sine 
of test X, the regression of Y on 
will not be altered. In order to st 
by capital letters, the other group 


ate selection in the threc- 


, and Z and find out 
will affect the correlation 
e selection was on the basis 
X and also the regression of Z on X 
ate this, we shall designate one group 
by lower-case letters, and write 


a (5) -n« (5) 


X 
(2) tes (5) = ns (2). 


x 


Chap. 12] Univariate Selection (the Three-Variable Case) 147 


Also, as before, it is reasonable to assume that the variance of Y about 
the regression of Y on X will be about the same for both groups. Simi- 
larly, the variance of Z about the regression of Z on X will be about the 
same for both groups. Again using capital letters for one group and 
small letters for the other, we can write 


(3) sy = Puy) = Sy "EM Rxy)), 
(4) s; p Tis) = Sz — Rxz’). 


In addition to the foregoing assumptions, which are identical with 
those that were used in the bivariate case, it is also necessary to make 
one other assumption for the three-variable case. This assumption is 
that the correlation between Y and Z for a constant X is the same for 
both groups. It can be seen that holding X constant by the statistical 
device of partial correlation should give about the same results, regard- 
less of whether or not there is selection on X. Holding X constant is 
the most extreme form of selection possible so that the resulting partial 
correlation between Y and Z will be about the same both for the entire 
group and the curtailed group. Using the conventional notation, we 


may write 


Equation (5) 
Ty: — Vay" xz Ryz — RxyRxz 


we Ryg.x = e. 
"wes T p 0E 0E) A A Ray — R 


All the assumptions necessary for the three-variable problem in 
univariate selection are given in equations 1 to 5. It should be noted 
that the equations are perfectly symmetrical so that any solution 
obtained for estimating a correlation in a group with a larger variance 
will apply equally well for estimating the correlation in a group with a 
smaller variance. Therefore, instead of saying that the lower-case 
letters stand for the restricted group and the upper-case letters for the 
unrestricted group, we shall adopt the convention that the lower-case 
letters designate values for the group on which complete information is 
available. For one of the groups involved it is necessary to have the 
complete information, consisting of three correlations (rz, zz, and ryz) 
and three standard deviations (Sz, Sy, and s;). We shall use the lower- 
case letters to designate this group, regardless of whether it is the re- 
stricted or the unrestricted group. 

The upper-case letters will be used to designate the group for which 
only one standard deviation is known. The equations developed will 
hold regardless of whether this group is the restricted or the unrestricted 
We shall consider first the case where only the standard devia- 


group. 
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tion of the explicit selection variable is known for the second group 
(only Sx is known). We shall consider second the case where only the 
standard deviation of one of the incidental selection variables is known 
for the second group (only Sy is known). 


4, Variance of both groups known for the explicit selection 
variable 


Let us proceed as before to solve equation 1 for Rxy, obtaining 


SySx 
(6) Rxy = fy 
Sysz 
Next this value for Rxy is substituted in equation 3, giving 
i 2 2 Sy Sx? 
(7) s (1 — r42) = Sy?(1— fu^ x 
Sy?s,” 


Multiplying the right side through by Sy? in order to remove paren- 
theses gives 


a5 ua 
(8) S — rJ) = Sy? — ra E, 
This equation can readily be solved for Sy?: 
2 2 2 2 Sy? 
(9) Sy? = ey | L = tay? + Tay —2|* 
Sx 


This value for Sy is then substituted in equation 6, H y being substituted 
for the ratio Sx/S,, giving the following value for Rxy: 


"yH x 
(10) Rxy = 2 


E vl1-— E + Try Hx” 


It will be noted by way of check that these values of Sy and Rxy are 
the same as the results previously obtained in Chapter 11 for Sy and 
Rxy on the basis of the assumption of selection on X. Similarly, we 
may write by analogy the results for Sz and Ryz in the present case. 


The solutions already obtained for Sy and Rxy will give Sz and xz 
if Z is substituted for Y as follows: ! 


(11) SZ = s? — ra? + rH?) 
" H ^: 

(12) Jp Lii HIP PPM 

vA Lx To EI rH x? 


! Students who do not 
should solve independent 
6to 10. It will be seen t 


yet feel at home with this device of substituting subscripts 
ly for Sz and Rxz by following the general plan of equations 
hat equations 11 and 12 must be the result of this procedure. 
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The only remaining task is to solve for Ryz. It involves solving 
equation 5 for Ryz and substituting the known values of Rxy and Ryz. 
A rather simple algebraie routine for doing this is to note from equations 
3 and 4 that we may write 


(13) vL =e = x AOL c ra) 


y 
and 


(14) VA - Rx? = zr v — rs). 
Sz 


These values may then be substituted for the denominator of the 
right-hand term in equation 5, giving 


(15) Tys — Taylrz EN RxyRxz)SySz 
A/T = tee V= VUS fu VI — Tar sys; 


Solving for Ryz, we have 


T (ry: ad Tayl z)8y85 


16 Ryz = + RxyRxz. 
(16) Yz SySz xyRxz 
From equations 1 and 2 we can write 

Sys Sx? 
(17) RxyRxz = Tues SS 


Substituting equation 17 in equation 16 and factoring out s,s:/SySz, 


we have s 
Sy8z Sx 

"aee mme 

which expresses Ryz in terms of given quantities and of the two values 

(Sy and Sz) for which solutions have already been obtained. -Substi- 

tuting equations 9 and 11 in equation 18 and simplifying, we have 


(18) Ryz 


Sx? 
Tyz — Vayfaz F Yay ui — y 
Sz 


5g mg Dn o —Á—ÀÁ—m 
(19) Ryz 7 i ke $ Ne , 
1 — rey + Tey PI l— rye + ie FE 


z 


s the correlation between two incidental selection variables 


for one group, and 
Sx is the standard deviation of the explicit selection variable in 


the same group. 


where Ryz i 
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For the other group involved, complete information is available as 
follows: 
Tyz is the correlation between two incidental selection variables, 
Try and rz; are the correlation of the explicit selection variable with 
each of the incidental selection variables, 
Sz is the standard deviation of the explicit selection variable, 
and 
Sy and s, are the standard deviations of the two incidental selection 
variables. 


If there is explicit selection on variable x, the variance of x, 
and three correlations (Tzu, Tzz, and ry.) are available for one 
group, and only the variance of X (Sx?) is available for a 
second group, then the correlation of the two incidental selec- 


tion variables (Ryz) for the second group may be estimated 
by equation 19. 


It should be noted that where the variance of the explicit selection 
variable is known for both groups, it is necessary to present only the 
equation for the correlation between the two incidental selection vari- 
ables. All other values (the variance of the two incidental selection 
variables and their correlation with the explicit selection variable) are 
given by the bivariate equations 18 and 20, presented in Chapter 11. 


Equation 19, or slight variants of it, has been given by Pearson 
(1903a) and Thorndike (1947). 


5. Variance of both groups known for one of the incidental 
selection variables 

We now turn to the final 
variables. Again we designat 
assume that the standard de 
of the other variables. 

Since both Y and Z are variables that have been su 
dental selection, it makes no difference whether Sy 
and Sz unknown, or vice versa. We shall assume 
for Sz. The equations derived can be applied generally by designating 
as Y the variable subject to incidental selection for Which both variances 
are known. The incidental selection variable for which only one vari- 
ance is known will be designated as variable Z. 

Without loss of generality then, we may say that selection is on 
variable X and that the standard deviations of both groups are available 
on variable Y. For this case the known values are fij 


case for univariate selection with three 
e the explicit selection variable as x and 
viation for both groups is known on one 


bject only to inci- 


is assumed known 
Sy known, and solve 


zz, Tyz, Szy Sy, 
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sz, and Sy. Knowing these values enables us immediately to solve for 
Rxy? by equation 3, giving 


(20) Rx =1- (25) (1 — r2), 
E 


which is the same as equation 8 of Chapter 11. Rearranging the terms 
in equation 1, we find that 


Sx  RxySr 
Sz TzySy 


Q1) 


Substituting the value for yy. found in equation 20 and solving for 


Sx, we have 
V Sy? — s? — rg? 
(22) Se eis. Y u ( zy ) 
ise 
Equation 22 is essentially the same as equation 12 of Chapter 11. 
Next let us solve for Sz. One way of doing this is to write equation 2 
explicitly for yz: 


(23) Rxz = Ux: 


This value for xz is then substituted in equation 4, giving 


2(1 — re?) = Sz |: 2 s 
z (L — 722°) = 8z“ — r =] 
(24) 82 z EE 
Removing the brackets on the right-hand side of equation 24, we obtain 
2 2 2 2 s: Sx? 
(25) s (1 — Tez") = Sz” — Tez" ^ 
Solving this equation for Sz), we have 
8 2 2 9 Sx" 
(26) Sz? = 8," | 1 — tee” T a | 
Sz“ 


Substituting equation 22 in equation 26 and simplifying gives 
2 Sy? — s (l — tay’) 
(7) — S-s [ 1 = m+ pep et). 
Sy Try 
Expanding and simplifying, we have 
By Tay” — sr EI 
? 


2, 2 
Sy Tzy 


(28) SZ = s? [ 
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where Sz” is the unknown variance of Z, expressed in terms of the 
known quantities, 
Sy? is the variance of one of the incidental selection variables 
for the second group, i 
8,” is the variance of y for the first group, 
sz? is the variance of z for the first group, and 
Tzy and rzz are the correlation of the explicit selection variable with 
each of the other variables for the first group. 


If selection is explicitly based on variable x, complete infor- 
mation is available for one group, and the variance of one of 
the incidental selection variables (y) is known for another 
group, then equation 28 may be used to estimate the vari- 
ance of the other incidental selection variable (Z) for the sec- 
ond group. 


In order to obtain a value for Ryxz, let us use equation 4 and solve 
for Rxz’, obtaining 
yaa = Tae) 
Sz 


Substituting the value of Sz? from equation 28, we obtain 


(29) Rxz? =1- 


H 2, mel =: 2) 
y zy zz 
(30) Rx? =1— XU 2:5 TE 
SY*r.? — sn sr? 


This solution expresses Rxz entirely in terms of known quantities. We 
may put this in another form by multiplying and simplifying, obtaining 


SVP... ud 
(31) Rxz = Trz J : x Sy Tu Try b 
$ Sy ree? — spra F Sra 
This form, it will be noted, could also have been readily obtained by 
substituting equations 22 and 28 in equation 23. 


In equation 31 Rxz is the unknown, the correlation between the 
explicit selection variable and the other incident 
and all other terms have the same definition 


al selection variable, 
as in equation 28. 


If selection is explicitly based on variable x, complete infor- 
mation is available for one group, and the variance of one 
of the incidental selection variables (y) is known for another 
group, then equation 31 should be used to estimate the cor- 


relation between the explicit selection variable and the other 
incidental selection variable for the second group. 
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Let us now turn to the problem of solving for Ryz. First let us note 
that from equations 3 and 4 we may write 


V (1 E jem el E Fass) (Suse) 
SySz 


(32) VQ — Rx) — Rxz’) = 
Substituting this value in equation 5 gives 

Ty: — Vul (yz — RxyRxz)SySz 
V- D -n3 VA ray ra) ss 


Solving this equation for Ryz, we have 


(33) 


8y8z 
(34) Ryz = oL (fyz — Tey'z:) RxyRxz. 
SySz 
From equations 1 and 2 we obtain the following expression for xy xz: 
Sys: Sx? 
(35) RxyRxz = "zyz: TOC 


If this value for the product Rxy?xz is substituted in equation 34 and 
the common terms (s,s-/SySz) factored out, we have 


Sx? 
(36) Ryg = —— | "ve — Mazz F À'uÓH—£|pg 
Z 


Sx 


This equation expresses Ryz in terms of known quantities, and the values 
Sx and Sz for which equations have already been given. If we substitute 
equations 22 and 28 in equation 36 and simplify, we get 
rs: (Sy? — sy) + TeylusSy. 
(37) Ryz = Bq. 2 2 SED 
Sy Vraza (Sy? — Sy ) + Su zy 
where ryz is the correlation between the two incidental selection variables 


for the first group, "e 
Ryz is the correlation between the two incidental selection variables 


for the second group, and the other terms have the same 
definitions as for equation 28. 


If selection is explicitly based on variable x, complete infor- 
mation is available for one group, and the variance of one of 
the incidental selection variables (y) is known for another 
en equation 87 should be used lo estimate the cor- 


group, th 
the two incidental selection variables for the 


relation between 
second group. 
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This completes the consideration of the second case considered under 
the three-variable problem, the case in which selection was on one vari- 
able (X), and the standard deviation of the variable that was subject 
to incidental selection was known for both groups. This variable was 
designated Y, and equations were derived to express Sx, Sz, Rxz, and 
Ryz, and Rxy in terms of the known quantities rz; 7ys, Try) Sry Sy, Szy 
and Sy. The equations for Sy and Rxy were similar to those derived 
for the corresponding bivariate case in Chapter 11. However, the 
equations (involving variable Z) for Sz, Ryz, and Rxz were different 
from those previously derived for the bivariate case. 

One general caution must be noted in the application of the selection 
equations presented in Chapters 11 and 12. They are applicable only 
when the selection is made on the basis of one variable, and is made in 
such a way as not to alter the regression line of other variables on the 
explicit selection variable or the error made in estimating other variables 
from the explicit selection variable. If these assumptions hold, then 
the equations are perfect. However, as noted at the beginning of 
Chapter 11, in most practical situations the equations are likely not 
to apply. 

We should inspect the frequency distribution of the uncurtailed dis- 
tribution and of the curtailed distribution for the variables involved 
before deciding upon the equations to use. If there is a sharp cut-off 
point, and every person above was accepted while every one below was 
rejected, it is clear that the variable in question may be regarded as the 
explicit selection variable, and the equations of Chapters 11 and 12 
used with confidence. If there is not a sharp cut-off point or a reasonably 
sharp cut-off point on one variable, the exact selection procedure must 
be more carefully investigated to determine if the type of selection used 
can reasonably be assumed not to have altered certain of the errors of 
estimate and regression slopes involved. If we can justify the assump- 
tions indicated in equations 1 to 5, the selection equations apply. If, 
after study, we feel that the assumptions of equations 1 to 5 do not 
apply, there is no way of estimating the probable effects of selec- 
tion. 

As mentioned in Chapter 11, there is no adequate substitute for a 
well-designed experiment that makes the proper comparisons without 
any selection procedure. If we must work in a practical situation where 
selection is essential, every effort must be made to have the selection 
proceed in a specified manner, using definite critical score points, and 
rejecting every one below and accepting every one above these points. 
If such procedures are instituted and actually followed, the selection 
formulas can be used. If the selection procedure varies in accordance 


———— — P. 


, 


-— 
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with “numerous practical considerations,” it is impossible to estimate 
the effect of selection. 


6. Summary = 
These are the basic assumptions for the three-variable case. 
1. The slopes of the regressions of the incidental on the explicit 
selection variable are not altered by selection. This assumption is 


given in equations 


(1) Try (2) = Rxy (=) 


and 


2) Trz (=) = Rxz (=). 


2. The error made in estimating either of the incidental selection 
variables from the explicit selection variable is not altered by selection. 


This assumption is given in equations 


(3) sj — rJ) = 8*0 — Rxr?) 
and 
(4) s2(1 — n) = S20 — Rxz”). 


lation between the two incidental selection vari- 


3. The partial corre 
selection. This assumption is given in equation 


ables is not altered by 
Tyz — Vay ce Ryz — Rxylixz 


e V = rap) — rez) T v - Rx- xz?) 


It was assumed that we were always given complete information on 
one of the groups, and the convention that this group was represented 
by lower-case letters was adopted. That is, it was assumed that fry, 
Taz, Tyz) Sz, Sy and Sz were always known. Unless complete information 
is available on one group, it is not possible to solve the problem. Two 
cases were then considered. 

1. The case was considered in which the standard deviation of the 
explicit selection variable (Sx) is known for the second group. The 
value of Sy was given in equation 9, and the value of Sz in equation 11. 
These equations are not repeated here because they are formally iden- 
tical with equation 20 of Chapter 11 (on bivariate selection). Similarly, 
the value of xy was given in equation 10, and of Rxz in equation 12. 
These equations are formally identical with equation 18, Chapter 11. 
'The only new problem posed by the three-variable problem for this first 
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case is estimating the correlation between the two incidental selection 
variables for the second group. This value is given by equation 


2. The case was also considered in which the standard deviation of 
one of the incidental selection variables was known for both groups. 
Without loss of generality, we may designate the known standard devia- 
tion Sy and use Sz for the unknown standard deviation of the other 
selection variable. The formula for Sx is given in equation 22, which 
is not repeated since it is identical with equation 12, Chapter 11. The 
variance of the other incidental selection variable (Sz?) is given by 
equation 


(28) S2 = s2 = — Sy fsa + I 
Qt = up 5 


Oy d 
Sy Try 


The value of Ryy is given in equation 20, which is identical with equa- 
tion 8, Chapter 11, and is not repeated here. The correlation Rxz is 
given by equation 


(31) í Rxz =r. J oS s Syn 
A TI2 S. ~ 


24. 2 2 2, 2 
Tes — 8y Tee + Sy Duy 


The correlation between the two incidental selection variables is given 
by equation 


(37) ts Tes c Bd F Vey" yzSy" 2. 
By Vr. (Sy? — 8,2) + sra 


"These equations (31 and 37) have no counterpart in the bivariate case. 


Problems 


1 Assume that we are dealing with a case in which explicit selection occurs on the 
criterion, and the test is subject only to incidental selection (such as would occur if 
everyone taking the entrance test were admitted, but only those who received “pass- 
ing" scores on the criterion were included in the selected group). 


(a) Write the equation showing the re| 


ances and validities for the curtailed and the complete group. 


(b) Write the equation showing the relationship between the entrance test vari- 
ances and reliabilities for both groups. 


lationship between the entrance test vari- 
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2. Compare, from information in previous chapters and problems, the relationship 
between reliability and validity: 
(a) When the change is due to alteration of test length. 


(b) When the change is due to explicit curtailment of test variability. 
(c) When the change is due to explicit selection on the criterion resulting in inci- 


dental selection on the test in question. 

3. Test X is given to a group of applicants for admission to a certain college. The 
mean score for all applicants is 150, and the standard deviation is 30. After the 
students have been selected on the basis of this test and admitted to college, test Y 
is given. The two tests, X and Y, are correlated with grade average, which is used 


as a criterion, with the following results: 


Test X; mean 170 standard deviation 20; validity .63. 
Test Y ; mean 160 standard deviation 25; validity .68. 
The correlation of X and Y is .80. 


According to these data, which test would be better to use for college admission? 
4. Prove that, if ry; = rz; and Sx < sz, Ryz > Rxz. 
b. Prove that, if ry; = rz: and Sx > sz, Ryz < Rxz. 


6. Study this reference: G. G. Thompson and 8. L. Witryol (1946), “The relation- 
ship between intelligence and motor learning ability as measured by a high relief 
finger maze,” J. Psychol., 22, pages 237-246. Š 

(a) Comment on the use of the correction for restriction of range. 

(b) Present a correction for homogeneity which you consider appropriate for these 


data. 
(c) Give both the calculations and the argument for the correction you select. 


13 


Correction for Multivariate Selection 
in the General Case 


1. Basic definitions and assumptions 


The equations for multivariate selection in the general case become 
almost prohibitively complex unless matrix algebra is used. Since only 
2 few theorems of matrix algebra are used in this derivation, these 
theorems will be summarized here. Any set of numbers arranged in 
rows and columns is termed a matrix, and is designated by a single 
letter, such as M, N, A, B. In the derivations of this chapter four basic 
matrices are necessary. We have the matrix of test scores for the 


variables subject to explicit selection. For N individuals and A tests 
we may define 


Xu SN Xia 


(1) Xy = 


Xvi TER Xna 


The X’s on the right-hand side of this equation are defined as deviation 
scores to simplify the formulas for variances and covariances. In defin- 
ing the score matrix we may let each individual represent a row and each 
test a column, or vice versa. In the score matrices used here we shall 
arbitrarily let each row represent an individual, and each column a test. 


The matrix of test scores for the variables subject to incidental selection 
is defined by 


Yn s: Yop 


(2) Yye = 


Yui +++ Vue 


for N individuals and B tests. Again the Y’s on the right-hand side of 
the equation designate deviation scores, 
158 


The X’s are regarded as 
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independent variables, and the Y’s as dependent variables, which may 
be estimated by a weighted sum of the X's. Let us use W x,r, to desig- 
nate the weight to be applied to X, to predict Y;. The complete matrix 
of weights will be defined by ` 


Wxin e War 
(3) Wxy = 

Wray c8 Wxars 
The first column contains the weights to be applied to the independent 
variables X; to X4 to predict Y}. In general any column (which may 


be designated b) gives the weights to apply to the independent variables 
X; to X4 to predict Y». If the predicted Y; is indicated by Ý», we have 


Ya = WxrXacd WurnXsdcWiunXa o (-1:B) 
N 
The weights are to be chosen so that D» (Yio — Ya)! is a minimum. 
i=1 
It is also necessary to introduce a diagonal matrix with the terms along 
the principal diagonal, each equal to 1/N, and all other terms equal to 


zero. Thus we have the square matrix 


1/N 0 “+ 0 0 
0 WN = ð 0 
(4) Deo = | - i $ à ; 
0 eO ]/N 0 
0 O0 0 IN 


where the subscript @ designates the number of rows (columns) in the 


matrix and may equal either A or B. 
It should be noted that D is a scalar; hence for any two matrices 
P and Q for which PQ exists, 
DPQ = PDQ = PQD. 


Persons not acquainted with matrix algebra will need to study a 
text—the first four chapters of Bócher's Higher Algebra (1907), for 
instance, or the first chapter of Thurstone (1935a) or (1947b). Those 
who have not worked recently with matrix algebra may find the following 


ten principles helpful in studying the derivations given here. 


160 


(E) 


Q) 


10. The inverse of a sum (M + N) 


. The inverse of 
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- The matrix sum M + N exists only if the number of rows in M 


is equal to the number of rows in N, and the number of columns 
in M is equal to the number of columns in N. 


. Matrix addition follows the associative and the commutative laws. 


M+(N+0)=(M+N)+0 


M+N=N+M. 


. The matrix product MN exists only if the number of columns in 


M is equal to the number of rows in N. Matrix multiplication is 
associative but not commutative. 


M(NO) = (MN)O, 
MN = NM. 


. Matrix multiplication satisfies the distributive law for both pre- 


multiplication and postmultiplication. 
M(N + O) = MN + MO, and . 
(N+ O)M = NM + OM. 


. Any square matrix M with a non-vanishing determinant 


(M| = 0) has an inverse, M^!. A matrix premultiplied or post- 
multiplied by its inverse gives the identity matrix, I. 


MM = MM” =1. 


- The inverse of a transpose is equal to the transpose of the inverse. 


(M = (M’)—. 


- The transpose of a product is the product of transposes taken in 


reverse order, 


(MNO)' = O’N’M’. 


- The transpose of a sum is the sum of the transposes. 


(M--N)-M'-4N. 


a product is the product of the inverses taken in 
reverse order. 


(MNO) = ONM, 


cannot in general be simplified. 


The solutions given in this chapter will be stated in terms of matrices 
of variances and covariances so that no explicit expressions for the 
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correlation matrices will be needed. The a matrix 
for the explicit selection variables, designated as Cy x, is given by 

(5) Cxx = X'4yXyaDaa. . 

In like manner, we may write Cyy, the variance-covariance matrix for 
the variables subject to incidental selection, as 

(6) Cry = Y’gnYypDpp. 

The (XY) covariance matrix is designated by Cxy and written 

(7) Cxy = X'aNYygDps. 


The variables subject to incidental selection (Y-variables) are regarded 
as being estimated by linear combinations of the explicit selection 
variables (X-variables). We designate the matrix of predicted Y-values 
by Ýxg and write 


(8) Yxg = Xy AWxy. 
The matrix of errors of prediction is given by the difference between the 
actual Y-values and the predicted Y-values. This matrix is designated 
E and written 
(9) E = Yypz — Yuy. 
The variance-covariance matrix of the errors of prediction, designated 
Cpp, is given by 
(10) Cer = E/EDzp. 

Equations 1 to 10, written in upper-case letters, will be used to 
designate the group for which complete information is not available on 
all variables. Corresponding equations with lower-case letters will be 


used to designate the group for which complete information is available. 
We have, corresponding to equations 5 to 10, for the group on which 


complete information is available, 


(11) Cer = X'anXnadaa, 
(12) Cyy = Y'onYnvd oo, 
(13) Cry = X'anYntdub, 
(14) ab = XnaWevs 
(15) e = Yndo — Ynis 
and 


(16) Ceo = e'edy;. 
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It should be noted that the number of cases in the two groups need 
not be the same so that, in general, n = N, daa = D44, and dy, = Dp. 
However, the number of explicit selection variables must be the same 
in both groups, and the number of incidental selection variables must 
be the same, so that a = A and b = B. 

The group designated by upper-case letters is assumed to be similar 
to the one designated by lower-case letters in that the gross score weights 
applied to the explicit selection variables to predict the incidental selec- 
tion variables are the same for both groups. This assumption is the 
generalization of the assumption of equal slopes given in equations 1 
and 2 of Chapter 12 for the three-variable ease in univariate selection. 
In matrix notation this assumption is written 


(17) Wxy = Wry. 


In addition, it is assumed that the errors of estimate are the same for 
both groups. (See equations 3 and 4, Chapter 12.) It is also assumed, 
as in equation 5 of Chapter 12, that the correlations among the Y's when 
the X's are partialled out is the same as the correlation among the y’s 
when the z's are partialled out. These two assumptions are written 
in the matrix equation 


(18) CEE = Cee. 


The diagonal terms of these matrices are the squares of the errors of 
estimate; assuming them equal corresponds to the generalization of 
assumptions of equations 3 and 4 of Chapter 12. The non-diagonal 
terms of equation 18 are the partial covariances. Assuming them equal 
corresponds to the generalization of equation 5, Chapter 12. 

Since the basic assumptions given in equations 17 and 18 involve 
Wxy and Cyr, we shall turn to the problem of expressing these matrices 
in terms of the basic variance-covariance matrices C xx, Cry, and Cyr. 

The error made in predicting any given set of Y-values, such as Y; 
(where 7 = 1 --- N, and b is some specified value that may be any one 
of the Values from 1 to B), is indicated by subtracting the summed 


product (>> WX iz) from Yi», squaring these differences, summing over 
£71 


i for a given value of b, and dividing by N. This gives an error variance 
term that is one of the diagonal terms of the matrix Czy. In order to 
deal solely with these diagonal terms, equations 19 to 21, inclusive, are 
in the usual algebraic summation notation; equation 22 returns to matrix 
notation. Let us designate a typical diagonal term by Ea? (where 
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b =1--- B). We then have 


N A 
(19) Ef =} Ya- L WeeXig)? l (b = 1---B). 
i=l g=1 N 
The multiple correlation problem is to select the weights (WW g») so as to 
minimize the value of the error variance 2,7. For a given value of 
E, we differentiate with respect to Wy, (h = 1--- A) and set the 
derivative equal to zero, obtaining 


Fg 9 N 


dE A 

(20) —— = — (Yo — D WaXia)Xa = 0. 
dWr Nin gi 

Removing parentheses and changing the order of summation in the 

second term, we have 


dE? 2| 


21 = 
en din N 


N A N 
25 YaXa- 2 Wes 2, XaXa | - 0. 
i=l g=l i=l 

For a single value of h, equation 21 states a single condition for mini- 
mizing a given term in the diagonal of Cgg. If we let h take in turn 
each of the values from 1 to A, while b remains fixed, equation 21 
indicates a set of A equations which specifies the weights necessary to 
minimize a given term (Es?) in the diagonal of Cgg. If b now takes in 
turn each of the values from 1 to B equation 21 indicates a set of AB 
equations which specifies the weights necessary to minimize in turn 
each of the terms (Ej?) (b = 1--- B) in the diagonal of Cgg. When 
h —1--- A and b = 1 -+ B, the first term of equation 21 is identical 
with the matrix given in equation 7. The last term is in the general 
case identical with the product of equations 5 and 3, or Cx xWxy. 
Putting equations 21 into matrix notation, we have 


(22) Cxy — CxxWxy = 0. 


From equation 22 we obtain the solution for the matrix of weights 
(Wxy). Transferring Cx xWxr to the other side of the equation, and 
premultiplying both sides by the inverse of Cx x, we have 
(23) C7lygCxy = C IxxCxxWxy = Wyry. 

A corresponding equation can be derived for the group on which 
complete information is available. Substituting lower-case letters in 


equation 23 gives this equation, 


(24) Way — C io 
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Since the transpose of both equations 23 and 24 will also be needed, we 
shall write these explicitly as 


(25) W'yx = C'yxC^! xx 
and 
(26) Wigs = C uae ass 


Since Cxx is a symmetric matrix, its inverse is also symmetric; hence 
(C^xx) = C^'xx, and correspondingly, since c,, is symmetrie, 
(Care) — Cares: 

Equations 23 and 24 or equations 25 and 26 give the best weights to 
use for the X’s (or for the z’s) to predict the Y’s (y’s). These equations 
are identical with the equations for the best weights used in multiple 
correlation. (See Chapter 20, equations 52 and 53.) 

We now turn to the evaluation of the variance-covariance matrix of 
the errors of prediction (Cpg) in terms of the matrices C xx, Cry, and 
Cxy, which can be obtained from the data on observed scores. Substi- 
tuting equation 9 in equation 10 gives 
(27) Cee = (Yng — Ys) (Yws — Ywe)Dpp. 

Removing parentheses and expanding, we have 
(28) Cer = Y'py¥yeDpp — Y'sv¥yeDpp 
— Y'py¥weDop + Y'a y Y unDnn- 
Substituting equations 6 and 8 in equation 28, we have 
(29) Crp = Cy, — Y’ayXyaWxyDpp — W'yxX'AyYypDpn 
+ W'rxX'anXyaW xyDnn- 
Substituting equations 5, 7, 23, and 25 in equation 29, and noting that 
D is a scalar, we obtain 
(800 Cp; = Cyy — C'yxC ^! x xCxy — C'yxC7 v XC, 
+ Cy xO xxCxxC 7! x Cxy. 


Since a matrix times its inverse is the identity matrix, the last two terms 


cancel each other, leaving 
(31) Cre = Cry — C'y C7! x XC. 


Equation 31 may be written in another form by substituting equation 23 


init. Alternatively, equation 25 may be substituted in equation 31. 
Making these substitutions gives 


32) Cer = Cry — C'yx Wyy = Cyy — W'yxCyy. 
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A corresponding set of equations can be derived for the group on which 
complete information is available. Substituting lower-case letters in 
equations 31 and 32 gives these equations as 


—1 
(33) Cee = Cyy — Chee zaCzry 
and 
(34) Coe = Cyy — C' yc Way EG W'yzêzy 


2. Complete information available for all the explicit 

selection variables 

We now consider the case in which the variance-covariance matrix 
of both groups is known for the variables subject to explicit selection. 
In this case Cxx is given. In general it is assumed that Czz, Cyy, and 
Czy are known. The problem of this section, then, is to express the 
unknown variance-covariance matrix for the incidental selection vari- 
ables Cyy and the covariance matrix Cyy as functions of the known 
terms Cyx, c, Cyy, and Czy Substituting equations 23 and 24 in 
equation 17 gives 


(35) C7 yxCxy = 607156. 


Premultiplying both sides by Cxx and noting that a matrix times its 
inverse is equal to the identity matrix, we have 


(36) Cxy = Cxxc Icy. 
Using equation 24, we may write equation 36 in the alternative form, 
(37) Cxy = CxxWzy. 
Since the transpose of Cxy will be needed in solving for Cyy, we may 
write it from equations 36 and 37 as 
(38) C'yx = c'e s Cxx = W'yÜxx. 
Equation 36 or 37 gives xy entirely in lerms of known quan- 
tities when complete information is available for the explicit 
selection variable (X). 
This solution for Cxy in terms of the explicit selection variables is given 
by Pearson (1903a), Aitken (1934), and Burt (1943) and (1944). 
To solve for Cyy, substitute equations 32 and 34 in equation 18, 
obtaining 
(39) Cyy — ClyxWxy = Cyy — yi Way. 


166 The Theory of Mental Tests [Chap. 13 
Using equation 17 and solving explicitly for Cyy, we obtain 

(40) Cry = Cyy + (C’yx — C'yz) Wry. 

Substituting equations 24 and 38 in equation 40, we have 

(41) Cry = Co + Cyt ux x6 uses, — c'e ratay: 


Since Cyy is symmetric, it is equal to its transpose. Thus, from equation 
40, we have an alternative form, 


(42) Cyy = Cy, + W’ yz(Cxy Em Czy). 


This equation is also given by Pearson (1903a), Aitken (1934), and 
Burt (1943) and (1944). 


Equation 41 gives Cyy entirely in terms of known quantities. 
Equation 40 or 42 gives Cyy if Cxy is taken from equation 
36 or 38. 


These equations complete the solution for the case in which complete 
information is available for the variables subject to explicit selection. 
Equations 36, 37, and 38 are generalizations of equation 18, Chapter 11, 
and equations 10 and 12 of Chapter 12. Equations 40, 41, and 42 are 


generalizations of equation 20, Chapter 11, and equations 9 and 11 of 
Chapter 12. 


3. Complete information available for some of the incidental 
selection variables 
For the generalized tre 


atment of this case, it is necessary to distinguish 
between two categories 


of variables subject to incidental selection. We 
will let Y (or y) designate only those incidental selection variables for 
which complete information is available. That is, Cyy is known in 
addition to ¢,,. Incidental selection variables for which incomplete 
information is available will be designated by Z (or z). For these only 
cz. is known. It thus becomes necessary to express five unknowns, 
Cxx, Cxy, Czz, Cxz, and Cyz in terms of the known quantities Cyy, 
Cym, Caz, C22, Cry, Cez, and €yz. 

The solution for all terms involving Z will be postponed. Let us 
first consider the solution for Cyy and Cxy in terms of the known 
quantities Cyy, Cez Czy, and cyy. 

Substituting equations 32, 34, and 17 in equation 18, we have 
(43) Cry — C'yyw,, = 


Transferring the know 


Cyy — Cy Way. 
n terms to one side of the equation, we have 


(44) C'yxw; = Cyy — Cyy F C yz Wey. 
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Postmultiplying both sides by w^, gives the solution for the transpose 
of Cxy as 


(45) C'yx = (Cry — Cyy)W lye + cy. 


Taking the transpose of both sides gives 


(46) Cxy m w'la(C€ry ES Cyy) = Cry. 


Equation 46 gives Cxy in terms of known quantities, if infor- 
mation on both groups 1s available for the incidental selection 
variables (Y). 


Equation 46 is a generalization of the solution for Rxy given in equation 
8, Chapter 11, and equation 20, Chapter 12. 

It should be noticed that this solution assumes that the inverse of 
Wzy exists. Since the inverse of a product is the product of the inverses 
taken in the reverse order, equation 24 gives 


(47) aus SO gets 
That is, Wzy has an inverse if Czy has an inverse. In other words, the 
variables y for which complete information is available must be at least 
equal in number to the explicit selection variables, and must not be 
linearly dependent on the explicit selection variables (x). If these 
two conditions are met, c^, will exist, w^; will exist, and the solution 
given by equation 46 will be meaningful. If the number of incidental 
selection variables for which complete information is available is less 
than the number of explicit selection variables, or if these incidental 
selection variables are linearly dependent on the explicit selection 
variables, the information available is not sufficient for an exact solution 
of the problem. Given such conditions, various sets of values for C yy 
would be possible solutions. If the number of incidental selection 
ariables for which complete information is available is greater than the 
number of explicit selection variables, Cxy is overdetermined. The 
solution for Cxy would be in terms of least squares or some other maxi- 
mum likelihood procedure. It would also be desirable to devise some 
method for assessing variation of the individual solutions from the least 
If this variation is small, the least squares solution 
could be accepted. If this variation is large, it probably indicates that 
unknown faetors are entering into the selection procedures so that a 
further effort must be made to secure a more accurate description of 
selection procedures before proceeding with any corrections for selection. 
In this book we shall consider only the ease in which adequate in- 
ailable, and the solution is exact. That is, the number 


squares solution. 


formation is av 
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of incidental selection variables for which complete information is avail- 
able (Y) is equal to and not linearly dependent on the explicit selection 
variables. In this case equation 46 is the solution for Cx y. 


The solution for Cx x may be obtained by postmultiplying both sides 
of equation 37 by w^, obtaining 


(48) Cxx = Cxyw 5, 
Using equations 46 and 24, we may write Cx x as 
(49) Cxx = w'(Cyy — Cyy)W yz + es 


Equation 49 gives Cxx in terms of known quantities when 
complete information is available for the incidental selection 
variables (Y). 


It should be noticed that the’solution for Cy also involves only the 
assumption that the inverse of w,, or of Czy exists. Equation 48 or 49 
is a generalization of the solution for Sy given in equation 12, Chapter UL, 
and in equation 22, Chapter 12. 

Thus we have Cyx and Cyy in terms of known quantities. The 
equations of the preceding section can then be used to give Czz, Cxz, 
and Cyz. Substituting Z for Y and z for y in equation 37, we have 


(50) Cxz = Cxxwz:. 


Substituting equation 49 in equation 50, we have 


(51) Cxz = w' Cry — Cyy)WoyWee + 0 Ws. 


Using equations 24 and 47, we have 
(52) Cxz = Ww’, (Cry — Cyy)O 3 yas; + Cree 


Equation 52 gives the solution for Cxz in terms of known 


quantities when complete information is available for the in- 
cidental selection variables (Y). 


Equation 52 is a generalization of equation 31 of Chapter 12. 
In a corresponding manner we may write the solution for Czz by 
substituting Z for Y and z for y in equation 42, obtaining 


(53) Caz = c; + W'ox(Cxz — Cr). 


Substituting equation 52 in equation 58, we have 


(54) Czz = Cos + W' we (Cyy — 05,67, 


^4—»—————— —— E 
ee 
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Using equation 26 and the rule that the inverse of a product is the 
product of the inverses in reverse order, we may write equation 54 in 
an alternative form as 


(55) Czz = Coz + cuc (Cry — Cy)! yaCee. 


Equations 54 and 55 give Czz in terms of known quantities 
when complete information is available on some of the inci- 
dental selection variables. 


Equation 55 is the generalization of equation 28, Chapter 12. 

The value of Cyz may be found from the assumption that the partial 
correlations between Y and Z (X held constant) are equal to the 
partial correlations between -y and z (x held constant). This matrix 
may be written by substituting Z for the second Y in equations 32, giving 


(56) Cg,g, = Cyz — C'yx Wyz = Cyz — W'yxCyz. 
In like manner, using the lower-case letters, we have 
(57) Copes = Cys — C'yaWzz = Cyz — W'yzCzz. 


If we assume that Cz, = Cepe» use assumption equation 17, and solve 
for Cyz from equations 56 and 57, we have 


(58) Cyz = Cyz + W'y2(Cxz — Cez). 

Substituting equation 52 in equation 58, we have 

(59) Cyz = Cys + W'yW 2y(Cyy — Cyy)c™!yzCzz 

which simplifies to 

(60) Cyz = c; + (Cry — Cyy)C ys Cz 
Equation 60 gives Cyz in terms of known quantities. 


Equation 60 is a generalization of the expression for Ryz given in 
equation 37 of Chapter 12. * 

This completes the solution for the general case in which complete 
information is available for some of the variables subject to incidental 
selection. An exact solution is possible only when the number of inci- 
dental selection variables for which complete information is available 
is at least equal to the number of explicit selection variables, and when 
there is not complete linear dependence between these incidental selec- 
tion variables and the explicit selection variables. The solution has 
been given for the five matrices Cy x, Czz, Cxz, Cxy, and Cyz in terms 
of the known quantities Cyy, Cuy, zz, C22, Cry, Crz, and Cyz. 
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It is also possible to solve a more general case in which complete 
information is assumed to be available for some of the incidental selection 
variables and some of the explicit selection variables. The detailed 
solution for this case will not be given. It can be solved by using the 
methods of this section to solve for the remaining explicit selection 
variables, and then using the methods in this and the preceding section 
to complete the solution. 


4. Summary 


In dealing with the general ease of multivariate selection, X (or x) 
was used to designate the variables subject to explicit selection, and 
Y (or y) to designate the variables subject to incidental selection. 
Lower-case letters are used to designate the group for which complete 
information is available. Thus the variance-covariance matrices C4 
and c, are known, as well as the covariance matrix Czy. Upper-case 
letters are used to designate the group for which only one variance- 
covariance matrix is known. That is, either Cx x or Cyy is known. The 
problem is to solve for the unknown variance-covariance matrix and 
for the covariance matrix Cyy. 

It is assumed that the properties of the regression of Y on X are 
identical with the properties of the regression of y on x. This means 
that the gross score weights are equal for both groups, that is, 


(17) Wxy = Wry. 
Since these are the least square weights, equation 17 may be rewritten as 
(35) C^ yrxCxy = Cnt, 


The assumption of identical properties of the two regressions also means 
that the error made in estimating Y from X is the same as the error 
made in estimating y from x; and that the correlations among the Y’s 
with the X’s partialled out are the same as the correlations among the 


y’s with the z's partialled out. These assumptions are given in the 
equation 


(18) Crp = Cee. 
Rewriting equation 18 explicitly for the least squares case, we have 
(39) Cry — C'yx Wxy = Cy — C'y Way. 


For the case in which complete information is available on all the 


explicit selection variables, Cx x is known. The two unknowns Cxy 
and Cyy are given by 


(87) Cxy = Cxxwz, 
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where 
(24) Way = C asy; 
n and by 
(42) Cyy = cy, + W'yr(Cxy — Czy). 


For the case in which complete information is available on some of 

the incidental selection variables, it is necessary to distinguish between 
the incidental selection variables for which complete information is 
available (designated by Y or y) and the incidental selection variables 
that are known for only one group (designated by Z or z). The solution 
for the unknowns (Cxy, Cxx, Cyz, Czz, and Cyz) can be indicated in 
terms of the known quantities (Cyy, Cyy, Cer, Czz, Cry, Crz, and Cy). It 
should be noted that all these solutions are dependent upon the existence 
of the inverse of Wzy that is equivalent to the existence of the inverse 
of Czy, since 
(47) Wi =e Ota, 
That is, we are not considering the cases in which the number of y’s 
is greater or less than the number of 2s; nor are we considering the case 
in which the y’s are linearly dependent upon the z's. For all the follow- 
ing solutions, it is assumed that the Czy or wz, is a square matrix with its 
rank equal to its order. 

For this case we have 


(46) Cxy = w'— 1, (Cry — Cyy) + Cry, 
where w’—!,, is given as the transpose of equation 47. Using equations 
46 and 47, we may write 


(48) Cxx = CxywW lys, 

(50) Cyz = CxxWss 

(53) Czz = Cea + W'ex(Cxz — Cra); 
and 

(58) Cyz = Cy: + W'y(Cxz — Cez) 
or 

(60) Cyz = Cyz (Cry — €yy)C istai 


For these equations the value of wz, may be found by substituting z 


for y in equation 24. 
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Problems 


1. Express the weights (Wxy) of equation 3 as functions of the correlation matrices 
Rxy and Rxx. Show all steps in the derivation and all the necessary assumptions 
and definitions. 

2. For the case in which complete information is available on the variables (x) 
subject to explicit selection, express the correlation matrices Rxy and Ryy as func- 
tions of the correlation matrices Rxx, rz, ry, and Iz, Show all the steps in the 
derivation and all the necessary assumptions and definitions. - 

3. For the case in which complete information is available on the variables (Y), 
subject to incidental selection, express the correlation matrices Ryy and Rxx as 
functions of the correlation matrices Ryy, rz, Tyy, and rzy. Show all the steps in the 
derivation and all the necessary assumptions and definitions. 


14 


A Statistical Criterion 
for Parallel Tests 


1. Introduction 

As indicated in Chapters 2, 3, 6, 7, 8, and 9, parallel tests are tests 
that have equal means, equal variances, and equal intercorrelations. 
For any given set of experimental data, where the parallel forms of a 
test are given to a single group, there will be, even under the best condi- 
tions, some small sampling differences. To be certain that the tests 
may be regarded as parallel, it is necessary to have some statistical 
criterion that will show whether or not the means may be regarded as 
samples from a population in which the means are identical, the variances 
may be regarded as samples from a population in which variances are 
identical, and the intercorrelations may be regarded as samples from a 
population in which the correlations are identical. Such a test has 
recently been provided by Dr. S. S. Wilks (Wilks, 1946). Since two 
parallel forms have only one intercorrelation, it is possible in this case 
to check only for equality of means and of variances; hence we must 
consider the case of three or more parallel tests in order to demonstrate 
the statistical criterion for parallel tests. 

We shall not give the derivation of this statistical criterion, which 
may be found in the foregoing reference but shall simply indicate the 
proper statistic to compute, and give the table for evaluating the sig- 
nificance of this statistic in the large sample case." 

In addition to equal means, variances, and reliabilities, parallel tests 
should have approximately equal validities for any given criterion. 
David Votaw has recently solved this problem as a part of his PhD 
dissertation in mathematical statistics at Princeton University (Votaw, 
1947, 1948). 

It should also be noted that, in addition to satisfying statistical 
criteria for being parallel, the tests should contain items dealing with the 

1 This material on tests of compound symmetry is given here with acknowledge- 


ments to and the permission of the editors of The Annals of Mathematical Statistics, 
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same subject matter, items of the same format, etc. In other words, the 
tests should be parallel as far as psychological judgment is concerned. 
At present, this criterion of psychological judgment is usually the only 
one used. The emphasis in this chapter is on the statistical criteria 
that the tests must satisfy in addition to the psychological criteria. 


2. Basic statistics needed to compute the statistical criterion 
for parallel tests 


Let us assume that k parallel forms of a test have been given to a 
population of N individuals. Assume further that the usual statistics 
have been obtained for such a set of data. These statistics are the 
mean of each test (M), the variance of each test (s;?), and the co- 
variances for each pair of tests (cga). 

It is then necessary to compute the following four quantities: 

D, the determinant of the variance, covariance matrix,! 


k 
DU sg 


g=l " 
() = ES the average variance, _~ 


k 
2 Cgh 


2 g=h=1 . 
(2 r= ue De , the average correlation, computed as the average 
k(k — s 


covariance divided by the average variance, 


and 
k 
> (M, — uy 
=] 
(3) v=Ż Ti , the variance of the means, 
where 
k 
> M, 
g=1 
(4) M= p the mean of the test means. 


3. Statistical criterion for equality of means, equality of 
variances, and equality of covariances 


Following Wilks’ notation, we shall use Lmve to designate the statistic 
appropriate for testing simultaneously the hypothesis that all means 
are equal, all variances are equal, and all covariances are equal. 


1 For the case of three parallel tests, 


: 1 the formulas are given without the use of 
determinantal notation. 


In order to deal with four or more parallel tests, it is neces- 
sary to know how to compute the value of determinants of order 4 or higher. 
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D 
©) Lus = = ; = 
stl + (5 — Drs —7 Fw 


In the simplest case of three tests, this reduces to 


2, 2. 9 "y " AME 
51782753 [1 + 2risrisros — ris? — rig? — ro37] 


s*(1 + 27) (s? — s?r + v)? 


(6) Lys = 


Small sample tables for evaluating this statistic are given by Wilks 
(1946). In the large sample case, according to Wilks, the sta- 
tistic —N log, Lmye is approximately distributed as’ chi-square with 
(k/2)(k + 3) — 3 degrees of freedom when the hypothesis of equal 
means, equal varianees, and equal covariances is true. We are com- 
paring an hypothesis using k means, k variances, and (k/2)(k — 1) 
covariances, a total of 2k + (k/2)(k — 1) parameters, with an hypothesis 
using only three parameters; hence the degrees of freedom will be 


2k + (k/2)(k — 1) — 3 = (k/2)(k + 3) — 3. 


For three tests, this reduces to 9 — 3 = 6 degrees of freedom. The 
statistic Lj,» varies between zero and unity. If the means are identical 
in the sample, the variances identical, and the covariances identical, L,, y¢ 
equals one. AS Linve approaches one, the quantity —N log Lmve ap- 
proaches zero. The accompanying table gives the 5 per cent and 1 per 
cent points so that, if the quantity — N logio Lmve calculated from a given 
set of data is less than the value given in the 5 per cent column for the 
appropriate number of tests (k), we may consider that the tests are 
parallel. If the value of —N logio Lmve from the data is greater than 
that in the 1 per cent column for appropriate k, there is less than one 
chance in a hundred that such a sample would be drawn from a popula- 
tion in which means.were equal, variances were equal, and covariances 


were equal. Under such circumstances we should conclude that the 


tests were not parallel in all respects. 

If Lmve is sufficiently near unity to support the hypothesis that the 
means are identical, the variances are identical, and the covariances are 
identical, the population is characterized by one common mean, one 
common variance, and a common correlation (in this case reliability) 


1 Table 1 is given in terms of common logarithms (that is, to the base 10) instead 
of in terms of the natural logarithms (to the base e), since extensive tables of common 
logarithms are more generally available than tables of the natural logarithms. 
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coefficient. Using the subscript zero to indicate the best estimates of 
these parameters, we have 


(7) * My M, 
5 > (k — 1) 

(8) So = s+ ae 
and 

» v 

sy—-— 

: k 
(9) Ta = Bo 3 
So 


where k is the number of parallel tests and the other terms are defined 
by equations 1, 2, 3, and 4. 

It should be noted that the use commonly made of chi-square and other 
significance tests is to test an hypothesis that the experimenter hopes 
is incorrect. The term “significant difference” means that the data 
diverge significantly from what would be expected in view of the hy- 
pothesis being tested. In other words, the experimenter tests the 
hypothesis that “A = B" while arranging the experimental conditions 
to the best of his ability so that A = B. The use of the criterion for 
parallel tests is an instance of testing an hypothesis that the investigator 
hopes will be verified. Since considerable effort has been expended to 
select items and establish norms so that the tests will be parallel, we 
hope that the means, variances, and covariances will be about equal. 
Therefore what we hope to find in this test is what would commonly 
be called an “insignificant difference.” 

Whenever the ideational structure of any scientific field has developed 
sufficiently, investigators will be testing hypotheses that they believe 
are true; hence they will be hoping to find insignificant differences be- 
tween the data they get and those to be expected from the hypothesis. 
The current search by psychologists for significant differences is merely a 
concomitant of the fact that they have no precise hypotheses that can 
be tested; hence typically the investigator does his best to shape condi- 
tions so that groups A and B will be different, and yet tests the hypothesis 
that “A = B" hoping to find that it is not adequate for the data. Only 
rarely do we find the next step, of devising an hypothesis that pre- 
sumably fits the data, testing this hypothesis, and finding a “non- 
significant difference,” which indicates that an acceptable hypothesis has 
been found. 


4. Criterion for equality of variances and equality of covariances 


If, on testing for the hypothesis that. means, variances, and covariances 
are equal, we find a significant difference (a small value of Lye or à 
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relatively large value of — N log, Lre), there would then be some interest 
in determining whether or not that difference is attributable solely to 
differences in means. If it can be shown that the variances are equal 
and that the covariances are equal, while the means are different, it is 
easy to adjust test scores (by adding or subtracting a suitable constant) 
so that the means of the adjusted scores will be equal. It should be 
noted that a test for equality of means cannot be made on data from 
the group of subjects used to compute the adjustment. However, if 
the norms established from one group are used to adjust test scores for à 
second group, the test for equality of means of these converted scores 
could be made on the second group. In order to determine whether the 
difficulty is with the means alone or with variances and covariances, or 
both, two other statistics are of interest, one for testing equality of vari- 
ances and equality of covariances and another for testing equality of 
means. The statistic (Ly-) for testing equality of variances and equality 
of covariances is like (Lmve) given in section 3 except that the term v is 
omitted from the denominator: 


D 
(10) "2 a sf TE Dr]is*a J Dy 


When there are three tests, we have 


2, 2,2 d. aH pA 
si3ss?ss? [1 + 2risrisres — rig? — M13” — Tog] 


i ir s 200-7? 


The quantity —N loge Lvc is approximately distributed for large 
samples according to the chi-square law with (k/2)(k + 1) — 2 degrees 
of freedom, when the hypothesis is true. For three tests, —N log, Lue 
is approximately distributed for large samples according to the chi- 
are law with four degrees of freedom. Lye varies between zero and 
one. As the variances become more alike and the covariances become 
more alike, the value of Lue approaches unity, or the value of — N log Ly. 
approaches zero. If the value of —N logio Ly. is smaller than that 
indicated in the 5 per cent column for appropriate k, we may conclude 
that the tests are parallel except possibly for differences in means. If 
the value of —N logio Le is larger than that indicated in the 1 per cent 
column for appropriate k, the tests are not parallel as far as variances 
and covariances are concerned. 

If the test with Lj, indicated that the tests could not be regarded as 
parallel, whereas the test with L,. showed that the tests were parallel 
as far as variances and covariances were concerned, the population 


squ 
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represented by the data is characterized by k + 2 parameters. These 
are the k means, one variance, and one reliability coefficient. The best 
estimates of these parameters are, respectively, the test means (J/,), 
the average variance, s? (see equation 1), and the average correlation 7, 
as given in equation 2. 

If, after finding a “significant” Lj», We also find a significant Lye, 
this shows that equating the means still would not make the tests 
parallel. Either the variances, the covariances, or both are significantly 
different. It would be desirable next to test covariances and variances 
separately for significance, for, if the covariances are not significantly 
different, both means and variances can be brought into line by an 
appropriate linear transformation. However, if the covariances are 
significantly different, it is not possible to set up norms that will “equate” 
the tests. Unfortunately, since it is not yet possible to test the signifi- 
cance of the difference of covariances independently of similarities or 
differences in variances, for the present this step must be omitted. If 
we have two samples to which the tests have been given, it is possible 
to make the equivalent of a test of covariances independently of vari- 
ances by the following procedure. Compute the standard deviations of 
both forms (s, and Sy) for the first sample. If the y-scores are multiplied 
by s,/sy, the standard deviation of the y-scores will be identical with 
that of the a-scores for that same sample. Now regardless of the 
standard deviations of x and y in the second sample, multiply all the 
y-scores by the multiplier (sz/sy) determined from the first sample. The 
test with Ls, may then be made, using the z-scores and the transformed 
y-scores both from the second sample. If the test Lye indicates that the 
forms are parallel, we may conclude that, if the multiplier Sx/Sy is used 
for the y-scores, the forms are parallel. If the test with Liye shows that 
the forms are not parallel, the difficulty probably lies with the covari- 
ances, since standard deviations were equated. 

Suppose, on the other hand, that, after finding a significant Lgs; WE 
find homogeneity when testing with Lee. It is then reasonable to suppose 
that the difference in means Was responsible for the heterogeneity shown 
by the test Lmve If subsequently the test by L, shows that the means 
are heterogeneous, we have a consistent set of results and can conclude 
that the variances and covariances are equal, and that the heterogeneity 
of the means is the reason for failure to show homogeneity when testing 
With La or Las. IE after failing to find a sufficiently large value of 
Lye, we find large values for both Lec and Lm, the results are inconsist- 
ent. In this sort of instance (which is not impossible) we are dealing 
with some peculiar borderline case in which the means alone or the vari- 
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ances and covariances alone show homogeneity. With the increased 
degrees of freedom for testing the more comprehensive hypothesis 
however, we find a significant difference. In such a case the only con- 
clusion we can reach is that the tests are not parallel tests, but that the 
difficulty is not clearly indicated as due to mean differences or to vari- 
ance, covariance differences. 


5. Statistical criterion for equality of means 
For testing equality of means (if Ly, is nonsignificant), we use 


s — r) 
(1 — r) + v 


The quantity —N (k — 1) loge Lm is distributed approximately as chi- 
square with k — 1 degrees of freedom for large samples when the 
hypothesis is true. The value of L, is unity if the sample means are 
identical, and approaches zero as the sample means diverge. The 
quantity —N (b — 1) log Lm is zero if the sample means are identical 
and increases as the sample means diverge. The 5 per cent and 1 per 
cent points for —N(k — 1) logio Lm are given in the last two columns 
of Table 1. 
Wilks (1946) has shown that 


(12) v ie 


(13) Law = Lye o Bary 


a pee 
This relationship may be usegl as a partial arithmetical check when all 
three values are computed. 

We can also conclude from equation 13 and from Table 1 that, if 
Ly and Ln are each small enough to give —N logio L values above the 
5 per cent or the 1 per cent points, the value of Lmve will be small enough 
to give a value of —N logio Lmve that will be above the 5 per cent or 
l per cent point, as the case may be. It will be noted that the 5 per 
cent and 1 per cent points for —N logio Lmve are in each case less than 
the sum of the two corresponding values for —N logio Loe and 
—N(k — 1)logio Lm. That is, if the means tested by themselves are 
significantly different and if also the variance-covariance matrix tested 
by itself shows a significant difference at the 1 per cent or 5 per cent 
level, the test with L,,;; must show a significant difference unless errors 
were made in the computations. 

If Lmve shows that the tests may be regarded as parallel, whereas one 
and only one of the other tests (either Lee or Lm) indicates a significant 
difference, again we are dealing with a perfectly possible but borderline 


[Chap. 14 


180 'The Theory of Mental Tests 


case, and must conclude that the tests are not parallel either with respect 
to means or with respect to variances and covariances. 


TABLE 1 


APPROXIMATE 5 PER CENT AND 1 PER CENT POINTS ror —N Loaio Linve, —N r0G10 Lie, 


AND —N (k — 1) LOG Lm For k = 2, 3, 4, 5,6 


—N logio Linve —N logio Lec =N (k — 1) logio Lin 

k 
5 1 5 1 5 1 
as. per cent | per cent Sis percent | per cent gn per cent | per cent ` 

2| 2| 2.60206 4.00000 I 1.66832 2.88150 | 1 | 1.66832 | 2.88150 
3| 6| 5.4685 7.3013 4| 4.12047 | 5.7660 |2 | 2.60206 | 4.00000 
4|11| 8.5448 | 10.7379 8 | 6.7347 8.7251 |3 | 3.39389 | 4.9270 
5 |17| 11.9809 | 14.5092 |13| 9.7117 | 12.0249 |4| 4.12047 | 5.7660 
6 |24| 15.8149 | 18.6659 | 19 | 13.0912 | 15.7175 | 5 | 4.8079 | 6.5519 


Adapted from Wilks (1946), page 266. 


If N 2 100, this table is sufficiently accurate. 


If N < 100, see Wilks (1946) 


for a detailed statement of the accuracy of this table, and for small sample methods. 
Note that the entries in this table are given in terms of logarithms to the base 10. 
Hence these entries are 0.43429 times the entries in Wilks (1946), page 266. 
A 


6. Illustrative problem for hypotheses Hmv Hye, and Hy 


'The computation of the three criteria is illustrated in the following 
example. Three parallel tests (1, 2, and 3) are given to 130 subjects. 


1 2 3 
Mean 27.8 28.3 27.9 
Standard 
Deviation 9.9 10.1 10.4 
ry -93 T13 .92 723 90 
M, (M, — M)? Sp Sp TghSpSh 
27.8 .04 9.9 98.01 92.9907 
28.3 .09 10.1 102.01 94.7232 
27.9 .01 10.4 108.16 94.5360 
Z 84.0 .14 308.18 282.2499 
M 238.0 (.07)* =v s? — 102.73 94.0833 — s?r 


* Divided by 2. 


The other sums are divided by 3. 


| 
| 
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D = 817s2°837(1 + 2rioriares — r12? — ris? — ros?) = 20,308.3859. 
sS(1— r) = 8.6467 [s°(1 — r)P = 74.7654 — s°(1 + 2r) = 290.8967. 
(1 — r) + v = 8.7167 [s?(1 — r) + v? = 75.9809. 


D 
Lmv = = F ; = .9188. 
. 1 + 27)[s*(1 — 7) + v* 


—N logio Lmre = 4.78. 


21+ 2[2 —nP ds, 


vc 


—N logio Lie = 3.87. 


s?(1 — r) 


= DESEE S = .9920. 


m 


—N(k — 1) logio Ln = 0.91. 


By reference to Table 1 it is clear that all three criteria show the three 
tests to be parallel. The data are in agreement with the hypothesis 
that the means are equal, the variances are equal, and the covariances 


are equal. 


7. Hypotheses of compound symmetry 

The criterion devised by Wilks (1946) applies only to means, vari- 
ances, and covariances of parallel tests. In addition, parallel tests 
should have equal validities for predicting any criterion. The statistical 
criteria for “compound symmetry” presented by Votaw (1948) include a 
statistical test for equal validities of a set of parallel tests. We shall 
present here only a restricted case of one form of compound symmetry, 
the case where we are interested in two sets of parallel tests (v, and z,) 
and the criteria yg. Let us say that there are k parallel tests in the 
a-set, f parallel tests in the z-set, and b different criteria to be predicted. 
This set of b + k + f tests given to N persons results in a variance- 


covariance matrix and its determinant D given in equation 14. 
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Equation (14) 


Sn? Cnys 0o Cnm Cnn Cuz °° Cuz Cn o Cuneo c5 Ca 
Cun Sm? >>> Cym Cuero Cysta 05570 Cyrk Cuza Cuza tio Cez 
Cw Cumu ooo Su? Cwno Cuza coo Cyn o Cun Cms *** Cof 
Cnk o Cns °°" Cam $n? Cra °° Cnm Cn Cryza cci Cn 
Czayy Creo 5770 Cram Cram Su? Crork Cro ruo o 7770 Carney 
p= 

Cann Cra 7^0 Cau Cap, Capza 00000 Sap Copy Caga ci] Cre, 
Com Cam o C5 Cam Cam Cam cio Cnno Sa? o Cun Can 
Ceo Cmn *** Cnyy Czozy Cnm) c0 Cpnn Cnn Se? cto Cy 
Czy Cogs It Caym Cayzy Copan ** nay Cnno Cum cct Sag? 


where Yg, yn (g, h = 1, 2, --- b) designates the criterion variables, 
V5, Xp (g, h = 1, 2, +++ k) designates the parallel tests of the -set, 
Ze, zù (9, h = 1, 2, --- f) designates the parallel tests of the z-set, 
s designates a standard deviation, 
€ designates a covariance term, 
b designates the number of criterion variables, 
k designates the number of parallel tests in the z-set, 
f designates the number of parallel tests in the z-set. 


We shall consider three hypotheses (Fine, Aye, and Hy») regarding the 
relationships among the set of tests designated in equation 14. 

Let Amv designate the hypothesis that, for each set of parallel tests, 
the population means are equal, the population variances are equal, the 
population covariances are equal, and the population covariances with 
any single criterion variable are equal; between any two sets of parallel 
tests, the population covariances are equal. For the case of two sets of 
parallel tests and 6 different criterion variables that is indicated in 
equation 14, let pyp Hz, and p, , designate population means, Cyg, Tings 
and c,? designate population variances, and let t with appropriate 
subscripts designate a population covariance. In terms of this notation, 
hypothesis Êmve asserts that: 


1. All wz, (g = 1, 2, --- k) are equal. 
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2. All p., (g = 1, 2, --- f) are equal. 

3. All a (g = 1, 2, --- k) are equal. 

4. All c4? (g = 1, 2, --- f) are equal. 

5. All £44, (g Æ h = 1, 2, --- k) are equal. 

6. All ta, (g Æ h = 1, 2, --- f) are equal. 

7. All f2,2, (g = 1,2, +- b; h = 1,2, --- f) are equal. 

8. For any fixed value of h (h = 1, 2, --- b), all tz, (g = 1,2, +++ k) 
are equal. 

9. For any fixed value of h (h = 1,2, --- b), all ta, (g = 1,2, =- f) 
are equal. 


Let Ê, designate the hypothesis that, for each set of parallel tests, 
the population variances are equal, the population covariances are 
equal, and the population covariances with any single criterion variable 
are equal; between any two sets of parallel tests, the population covari- 
ances are equal. This hypothesis is identical with H,,»., except that no 
restrictions are imposed on the means. 

Let Êm designate the hypothesis that, for each set of parallel tests, 
the population means are equal, given that I, is true. 


8. Basic statistics needed for tests of compound symmetry 

In addition to the determinant of equation 14, and the variances and 
covariances indicated in its elements, the following quantities are needed 
for the tests of compound symmetry represented by hypotheses mvc, 


Ê ye, and By. 
The mean for each of the k + f predictor tests and the grand mean for 


each of the two sets is needed as follows: 
N 


27 Xig 


= i=1 


(15) X= -W ’ 


vo Rid 
(16) eun APO 
m 
D Ze 
" i=l 
(17) Sat 
f N 
25 25 Bee 
= g=1 i=1 


(18) d-—— M 


184 'The Theory of Mental Tests 


'The variance of each set of means is also needed: 


k 
D> &., — X.» 
(19) re 


(20) v: = FE] 


[Chap. 14 


The averages for certain sets of variances and covariances are needed, 


as follows: 


k 
>D Cun 
h=1 


(21) Cys = E , 


fi 
2j Cygzn 
h=1 


(22) [7- a f , 
hk 
E Szp 
h=1 
(23) ig = k , 
J 
27s) 
(24) u, =" 


(25) aes g+h=1 


(26) AD st cL 


(27) ` p= £—1 h=1 } 
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Using the matrix of criterion intercorrelations from equation 14, and 
the averages defined in equations 21 to 27, we define the determinant B, 
of order b + 2, as follows: 


Equation (28) 


8)" Crue t7 Cmm Ena Vk En VT 
Vi EVI 


Cya Sys ` Cym Cyt 


Cum Curya pw Sng nV k &u VT 
ipa Vk ipa Vk ades Eye Vk [us =F (k E 1)w;] zV kf 
En V/f & VT E En Vf Es Vif uz + (f — 1)w; 


9. The criterion for hypothesis fL, 
The sample criterion for Ayre is given by 
D 
Bjus — we + v] [u; — w: + vj 
If N is large and Ĥ mvc is true, the quantity —N log, Ln ye is distributed 
k+f 
g (Ff 3) + ok +f—2) m 
degrees of freedom. If only one set of parallel tests is available, for the 
test with Lj; we have (k/2)(k + 3) + b(k — 1) — 3 degrees of freedom. 
The general formulation for any number of sets of parallel tests with N 


large is given by Votaw (1948), page 467. 
For the special case of two parallel tests designated by subscripts 


1 and 2 and a single criterion variable (y), we have 


(29) Ln vc = 


approximately as chi-square with 


2 FAROA 2 19 
84781789" (1 + 2rytryori2 — My” — ry? — ns?) 


830) Lnve = [su + w) — 26,2][u — w + y] 


where u — (si? + 85?)/2, 
Ww = "128182; 
(X, — X3? 
a 2 
ke Cyri + Cyn. 
XE 


v E 
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If N is large and An». is true, the quantity —N log, Ĉmve is distributed 
approximately as chi-square with 5 + 1 — 3 = 3 degrees of freedom. 

When , is sufficiently near unity to support the hypothesis Lin vey 
the best estimates (shown below by 0 subscripts) of the common param- 
eters indicated in conditions 1 to 9 for Hye, are as follows. 


(31) Rigi Xe. (see equation 16), 
(32) Zo = Z.. (see equation 18), 
vz(k — 1 
(33) Sag? = Ur + aea (see equations 19 and 23), 
v-(f— 1 
(34) S4? = Uz + ID (see equations 20 and 24), 

(wz — v/k) , 
(35) Tig = ——T—- (see equations 19 and 25), 
B 
Wz — Vz/ í 
(36) Troy = fe ou (see equations 20 and 26), 
a 
Cz 
(37) Tas = — (see equation 27), 
SzoSzo 
jiz 
(38) Tove = — (see equation 21), 
SyS zo 
Ci 
(39) Tap, = — (see equation 22). 
Sy;Szo 


10. The criterion for hypothesis HM 

If the value of Êmve is small (that is, the value of — N log, Lj, i8 
large), Ave cannot be accepted. In this case we may wish to see if 
the differences in means of the parallel tests account for the failure to 


satisfy mv. In order to do this we next investigate hypothesis Ê pe. 
For this test the sample criterion is taken as 


B D 
Blus — wu, — wr 
If N is large and Ê, is true, the quantity —N log, Lee is distributed 


k 
24 (6 StI + b+ 1—2) 5 


degrees of freedom. If only one set of parallel tests is available, for the 


(40) L 


ve 


approximately as chi-square with 
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test with Z,, we have (k/2)( + 1) + b(k — 1) 
AG c ; — 1) — 2 degrees of fi 
'The general formulation for any number of tests with N es os 
by Votaw (1948), page 467. . ES 
For the special case of one criterion variable (; 
0 i y) and two 
(designated by subscripts 1 and 2), we have ics ig 


2.2. 2 
sy si s2 [1 + 2ryryorig — Ty? — ry? — ns] 


(41) Live = = 
[s (tz + We) — 26,2] us — we] ^ 
where wz = (Sa? + Sz )/2 (see equation 23) 
Wr = Cuz (see equation 25), a 1 
p a , and 
ys = (Cyn + Cym)/2 (see equation 21). 


In this case, when N is large and Ê ,, is true, the quantity —N log, Ê 
is distributed approximately as chi-square with 3 + 1 — 2 = 2 du a 
of freedom. ine 

If the test with Êmve indicated that the tests could not be regarded as 
parallel, whereas the test with Live indicated that the tests could be re- 
garded as parallel as far as variances and covariances were concerned 
the population represented by the data is characterized by: ; 


1. A mean for each test, giving k 4- f 4- b means, represented by 
Xz, Ze, and Ye- 

2. A variance for each criterion variable (82) and the two variances 
uz and us, given by equations 23 and 24. 

3. Two reliability coefficients given by wz/u; and w;/; (see equations 


23 to 26). 
4. The intercorrelation rz: given by &z/'V uu. (see equations 23 


24, and 27). 
5. Two validity coefficients for each y, given by &y,s/ (S, V/u, ) and 


LU ORT ) (see equations 21 and 22). 


11. The criterion for hypothesis Hn 

If the test with £,,,, has shown “significant” differences, whereas the 
test with Ly, substantiates hypothesis Il, the presumption would be 
that the tests might be regarded as parallel except for the values of the 
means. If Êu is true, it is possible to test the means directly. The 
criterion for hypothesis Ê, (assuming I) is 

i (uy — Ww) (ta — w) 

42 m = = : 
( ) n (uz — Wz + v=)" lu, — w;-4 v, 


If N is large and Ên is true, the quantity —N log, Ln is distributed 


approximately as chi-square with k + f — 2 degrees of freedom. 
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If only one set of parallel tests is used, for the test with Êm we have 
k — 1 degrees of freedom. The general formulation for any number 
of sets of parallel tests with N large is given by Votaw (1948), page 467. 
As an arithmetical check it should be noted that 


(43) Doge + Doug = ase 


12. Illustrative problem for compound symmetry 

The computation of the three criteria for compound symmetry is 
illustrated in the following example. Information is available on three 
parallel tests (1, 2, and 3) and a criterion (y) for 100 subjects. The 
correlations, means, and standard deviations are: 


y 1 2 3 

y 1.00 .64 .06 .65 

1 .64 1.00 .88 .92 

2 .66 .88 1.00 .90 

3 .65 .92 .90 1.00 

Standard 

deviation 21. 10. 9. 12. 
Mean 191. 118. 117. 119. 


The determinant of the correlation matrix is .014 420 64. Multiply- 
ing this by the product of the four variances gives the determinant of 
the variance-covariance matrix (D), 


D — 144 X 441 X 81 X 100 x .01442064 — 7,417,723. 

The determinant B = (441)(299.5) — (1414/3 )? = 72,436.5, 

(us — we + v7! = (1083 — 95.6 + 1)? = 187.69, 

(uz — wz)" = (108.3 — 95.6)? = 161.29. 
From the foregoing results, we have 
7,417,723 
7,417,723 
* (72,436.5)(161.29) 


_ 16129 - 
" — 187.69 


= .5456, 
= .6349, 
= .8593. 


Table 2 shows the 5 per cent and 1 per cent points for —N log, D, 
which is chi-square, and also the corresponding 5 per cent and 1 per cent 
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TABLE 2 


APPROXIMATE 5 PER CENT AND 1 PER CENT POINTS ror —N Loc, L, AND ALSO FOR 
—N 10G19 L ron Decrees or FREEDOM 1 To 30 


—N log. L —N logo L 
d.f. 5 per cent 1 per cent. 5 per cent 1 per cent 
1 3.84 6.64 1.67 2.88 
2 5.99 9.21 2.60 4.00 
3 7.82 11.34 3.39 4.93 
4 9.49 13.28 ; 4.12 5.77 
5 11.07 15.09 4.81 6.55 
6 12.59 16.81 5.47 7.30 
7 14.07 18.48 6.11 8.02 
8 15.51 20.09 6.73 8.72 
9 16.92 21.67 7.35 9.41 
10 18.31 23.21 7.95 10.08 
11 19.68 24.72 8.54 10.74 
12 21.03 26.22 9.13 11.39 
13 22.36 27.69 9.71 12.03 
14 23.68 29.14 10.28 12.66 
15 25.00 30.58 10.86 13.28 
16 26.30 32.00 11.42 13.90 
17 27.59 33.41 11.98 14.51 
18 28.87 34.80 12.54 15.11 
19 30.14 36.19 13.09 15.72 
20 31.41 37.57 13.64 16.32 
21 32.07 38.93 14.19 16.91 
22 33.92 40.29 14.73 17.50 
23 35.17 41.64 15.27 18.08 
24 36.42 42.98 15.82 18.67 
25 37.65 44.31 16.35 19.24 
; 38.88 45.64 16.89 19.82 
* d 46.96 17.42 20.39 
28 41.34 48.28 17.95 20.97 
29 42.56 49.59 18.48 21.54 
30 43.77 50.89 19.01 22.10 


Vor d.f. larger than 30, i 
£= V2 — V2(df) — 1 
is distributed approximately as the unit normal curve. For x the 5 per cent point is 


5 1 per cent point is 2.326. 
id i Mein simian 2 and 3 are chi-square values; those in columns 4 and 5 are 


0.43429 times the corresponding chi-square value. 
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points for —N logio L, which is 0.43429 times the corresponding chi- 
square value. These values in terms of logarithms to the base 10 are 
given because such tables are usually more readily available; hence some 
workers may prefer to use these values. For this illustrative problem, 
we have 


=N loge Lmve = 607  —N logio nn = 263 d = 8, 
—N log, Lug = 45.4 =N logio Ĉu = 197 df. = 6, 
—N log, Lm = 15.2 —N logio Ln = 659 df. =2. 


By reference to Table 2 we see that these values are considerably 
larger than the 1 per cent point values for the corresponding degrees of 
freedom. The values are clearly significant at the 1 per cent level, that 
is, these three tests cannot be regarded as parallel tests for predicting 
criterion y. 


13. Summary 


The statistic Lmve given in equations 5 and 6 is used to test simul- 
taneously for equality of means, variances, and covariances. If they 
are equal, the best estimate of each is given in equations 7, 8, and 9. 

The statistic Ly. given in equations 10 and 11 is used to test simul- 
taneously for equality of varinnees and covariances. Tf these are equal 
the best estimate of each is given by equations 1 and 2. 

The statistic Lm given in equation 12 is used to test for equality of 
Means (assuming equality of variances and covariances), 

The tables of the 5 per cent and 1 per cent, points are given in terms 
of —N logio Lmv, —N logio Lye, and —N(k — 1) logio Lm. If the 
value computed from the data is greater than the one found in the table, 
the tests cannot be regarded as parallel. If it is less than the one found 
in the table, the indication is that the tests may be regarded ag parallel. 

If one or more of the three Statistics Linve, Lye, and Lm show a signifi- 
eant difference, we must conclude that the tests are not strictly parallel. 
There is only one combination of results that is impossible. It is not 
possible that Lye indicates a non-significant difference when Ly, and 
Lm cach indicate a significant difference at the 1 per cent or 5 per cent 
level. The tests can be regarded as parallel only when each of the three 
statistics considered separately shows a non-significant, difference. 

The more general case of compound symmetry that includes equality 
of validity coefficients and equality of correlations between two sets 
of parallel tests was also presented. Equations for the three criteria 
Lyc, Eve, and L, are presented in equations 29, 40, and 42, respectively. 
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Problems 


1. Three comparable forms of a test are given to 200 persons with the following 


results: 
M; = 54.0 M» = 55.5 M = 56.6, 


sı = 13.9 s: = 13.5 s3 = 14.4, 
rg = .90 7133 = .88 To = — .806. 
Do these data indicate that the tests are parallel? 


2. The following table gives means, standard deviations, and correlations for four 
of the subtests of the College Entrance Examination Board Comprehensive Mathe- 
matics Test, Form WCM-1, April 1948. (These data were supplied by Mr. Richard 
Pearson and Dr. Ledyard Tucker of the Educational Testing Service.) 


Subtest 3 | Subtest 4 | Subtest 5 | Subtest 6 


Subtest 3 .7350 .5983 .6203 
Subtest 4 .7350 .6049 .6357 
Subtest 5 .5983 .6049 .5515 
Subtest 6 .6203 .6357 .5515 

Means 9.0983 9.5817 3.0417 2.4483 
Standard 


deviations | 3.9411 4.4291 1,7569 1,9330 


(The foregoing data are based on a sample of 600.) 


rded as parallel tests? 


(a) Can these four subtests he rega : 
to the means be sufficient to make the 


(b) Would additive adjustments to equa 
tests parallel? 

(e) Can subtests 3 and 4 be regarded as parallel tests with respect to means? 
With respect to variances and covariances? With respect to all threw? 

(d) Can subtests 5 and 6 be regarded as parallel tests with respect to means? 
With respect to variances and covariances? With respect to all three? 


3. The following table gives means, variances, and covariances for various grades 
on the College Entrance Examination Board English Composition Test, December 
1946. (These data were supplied by Dr. Ledyard Tucker of the Educational Testing 


Service.) 
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A B c D E 
A 25.0704 | 12.4363 | 11.7257 | 20.7510 | 20.9125 
B 12.4363 | 28.2021 | 9.2281 | 11.9732 | 23.4544 
C 11.7257 | 9.2281 | 22.7390 | 12.0692 | 18.0384 
D 20.7510 | 11.9732 | 12.0692 | 21.8707 | 19.8371 
E 20.9425 | 23.4544 | 18.0384 | 19.8371 | 77.8976 


Means | 14.9048 | 15.4841 | 14.4444 | 14.3810 | 28.0556 


(N = 126) 


A = reader’s grade on original theme, question 1. 

B — a different reader's grade on a hand copy of the original theme, question 1. 
(The second reader would not know that the theme had been read before.) 

C = Carbon copy of B. (With this copy the reader would know that he was reading 
a theme already graded by someone else as check on the accuracy of reading.) 

D = Table leader's check on the grade assigned in A. He might either le. 
stand, or alter it as a result of his check reading. 

E = Sum of reader's scores on questions 2 and 3. 


t the grade 


(a) Can the four grades assigned to question 1 (A, B, C, and D) be regarded as 
parallel grades? 

(b) Can the three grades that were assigned independently (that is, without knowl- 
edge of previous grades), A, B, and C, be regarded as parallel grades? 

(c) Can A and B be regarded as parallel grades? 

(d) Can C and D be regarded as parallel grades? 

(e) From these results, what conclusions can be drawn regarding the precautions 
necessary in checking on the reliability of reading English themes? 

(f) Can B, C, and D be regarded as parallel tests for predicting a criterion E? 

(g) Can A, B, and C be regarded as parallel tests for predicting criterion E? 


4. Given the following table Showing means, standard deviations, and intercor- 
relations for a criterion y, and three tests 71, 2, and zs, on a group of 50 persons can 
the three tests be regarded as parallel tests for the purpose of predicting the criterion y? 


y TI T2 T3 

y 1.00 .52 .56 .53 
Ti -52 | 1.00 .94 .91 
T2 -56 -94 | 1.00 .89 
Z3 .53 .91 -89 | 1.00 
Means 45.0 | 29.0 | 30.0 | 31.0 
Standard 

deviations | 24.0 9.0 8.0 7.0 
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Experimental Methods 
of Obtaining Test Reliability 


1. Introduction 

In previous chapters (see Chapters 2, 3, 6, and 8) reliability was 
defined as the “correlation between parallel tests." In Chapters 2 
and 3 a definition of parallel tests was given, in terms of equality of 
means, standard deviations, and intercorrelations. Chapter 14 m 
sented a statistical test for equality of a set of means, a set of Varii 
and a set of covariances. In this chapter we shall consider the different 
possible ways of obtaining parallel test scores. 

The term reliability was introduced by Spearman in his basic paper: 
on test theory; see Spearman (1904a), (1904b), (1907), (1910) and 
(1913). Since then there have been many discussions 5 the E u 
factors influencing reliability in relation to the different aeri of 
measuring reliability. For an introduction to these discussions see fo 
example, Kelley (1921), Muenzinger (1927), Symonds (1928) me 
(1934), Adams (1936), Kuder and Richardson (1937), Kelley (1942), 
Guttman (1945), Cronbach (1947), and Thorndike (1947). There Ei 
many different ways of classifying the factors influencing reliability ES 
the methods of measuring reliability. Here we shall consider the fol- 


lowing major methods. 


The use of parallel forms. 


Retesting with the same test form. 
Various split-half methods, such as first versus second halves, odd 


versus even items, and the method of matched random subtests 


(either halves or thirds). 

Recently methods of assessing test reliability or homogeneity have 
been devised that do not make use of correlation of parallel scores. In- 
stead, these methods use item analysis data to assess the homogeneit; 
of the group of items in the test. One of these methods will be ij 


sidered in the next chapter. 
193 
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Although the error of measurement, discussed in Chapters 2, 3, 4, 
and 5, is a more basic concept in test theory than the reliability coeffi- 
cient, it has become customary during the last forty years to assess 
tests in terms of the reliability coefficient rather than in terms of the 
error of measurement. Since there are advantages and disadvantages 
for each of these measures, it is urged here that both must always be given 
in order to make possible a complete assessment of any test. Otis and 
Knollin (1921) pointed out that the error of measurement is superior to 
the reliability coefficient, in that it does not vary with changes in the 
heterogeneity of the group. This property of the error of measurement 
and its effect on the reliability coefficient were discussed in Chapter 10. 
Kelley (1921) and Franzen and Derryberry (1932a) indicated that, 
although the error of measurement did not vary with group heteroge- 
neity, nevertheless the unit in which the error of measurement was ex- 
pressed did vary from one test to another. They suggested several 
ways of overcoming this disadvantage. Lincoln (1932) and (1933) 
pointed out that reliability could be very high even when the differences 
between two sets of measures were very large. This point was also 
amplified and clarified by Ackerson (1933). 

The tests or subtests that are correlated to determine test reliability 
should be parallel both in the sense that they satisfy the statistical 
criteria for parallel tests presented in Chapter 14 and in the sense that 
the items appear to require the same psychological processes and the 
same type of learning on the part of the subjects. This latter criterion 
depends on the judgment of the test technician and the subject matter 
expert, and it will be different for each different type of aptitude and 
achievement test. We shall consider here only general methods of 


setting up parallel tests or subtests, which are common to all types of 
material. 


2. Use of parallel forms 


For most sorts of situations, it will be found that the best method of 
obtaining a test reliability is to construct parallel forms of the test and 
administer them on different days to the same group of subjects. The 
method usually used would be to construct two parallel forms for this 
purpose. However, from the discussion presented in Chapter 14, we 
see that with three parallel forms it is possible to make a more complete 
check and to be certain that the forms are parallel, not only with respect 
to means and variances but also with respect to correlations. 

There is only one situation in which the use of parallel forms admin- 
istered on different days is not advisable. This is when the ability 
that is being tested changes markedly in the interval between the tests. 


< 


i — "2 


t 
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For example, if we wish to determine the reliability of a typewriting 
test by administering one form to a group on Monday and another form 
on Friday, the method would not work if the group was practicing (and 
hence improving rapidly in typewriting ability) during the intervening 
time. Likewise the method is not good if the first test is given when 
the subjects are in excellent “form” and the second test is given when 
the subject’s ability has decreased, for lack of practice during the inter- 
vening week. | 

'The same sort of consideration applies, for example, to any test of 
physical fitness or muscular skill. The two administrations of the test 
cannot be used to estimate the reliability of the test if there is good . 
reason for believing that the subjects have either improved or declined 
in the ability that is being tested. 

For most tests of scholastic achievement and mental ability, it is 
reasonably easy to be sure that the subjects have not actually changed 
markedly during the period intervening between two tests. For other 
types of performance, of which athletie skills of various types are a 
good example, it is very difficult to maintain a group at a state of uni- 
form excellence. The skill is likely to deteriorate with lack of practice, 
and may either improve or the person may “go stale” with practice. 
In such cases all the “error of measurement” cannot be attributed to 
the test. Much of what shows up in the statistical check as error of 
measurement is actually true variation in ability. However, from 
another point of view we must perhaps recognize that measurement of 
some skills is extremely unreliable (regardless of the cause of this un- 
reliability); hence in using any such measures we must for many pur- 
poses treat them just as we would treat very unreliable measures. 

However, if we are dealing with a period of time during which the 
ability measured will not change systematically for different members 
of the group, and are dealing with a group of subjects under conditions 
such that it is not likely that the ability will change, the use of different 
forms of the test is the most realistic method of indicating reliability. 

It should be noted that as tests are actually used, if several forms of 
a test are available, we are likely to use any of the forms somewhat in- 
differently. Likewise, if we are testing a freshman class, the test is 
o come on different days in different institutions, or in different 
years. We can thus see that any form of the test, may be given, and it 
may be given on any day, so that variability introduced by change of 
form and change of day would normally enter into the error of measure- 


likely t 


ment of a test. — 
It should also be pointed out that the error possibilities noted above 


can be easily detected. If the group has improved or deteriorated, the 
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mean will be higher or lower the second time. If some persons have 
improved and others deteriorated, the standard deviation will in all 
likelihood have changed. A complicated set of influences in which some 
persons improve and others deteriorate in such a way that the mean and 
standard deviation of the group remain the same is a possibility, but 
would doubtless be very rare. 

In summary, the method of testing with parallel forms given several 
days apart is a method that allows the relevant sources of error to influ- 
ence the reliability coefficient. If the statistical tests for equal means 
and standard deviations are used, and satisfied, the method is one that 
may be used routinely with relatively little fear that undetected and 
irrelevant factors are rendering the obtained reliability coefficient either 
spuriously high or spuriously low. 

It should be noted, since speed tests will enter prominently into some 
of the later discussion, that the parallel forms method is valid for speed 
tests. A speed test is a test composed of very easy items—items so 
simple that everyone could answer them if given time. For example, 2 
set of two-digit additions given to eighth-grade students would approach 
being a “speed test.” If we are to get a good range of scores on such a 
test, it is necessary to have a large number of items, and to set a time 
limit so short that only the best people in the class finish, if at all. In 
such a test, practice effect from one time to the next is important. Un- 
less such conditions as amount of practice and use of “fore exercise" 
were very carefully standardized, it would not be possible to have the 
mean and variance of the parallel forms the same for the group. How- 
ever, if means and variances are the same one can be reasonably certain 
that the intercorrelation between the two parallel forms is a reasonable 
approximation to the reliability coefficient that the test should have. 

A parallel form reliability may also be secured by administering both 
forms at the same session. In some tests there may again be a marked 
difference of performance due to the fact that the giving of the first test 

" influenced the second test. For example, if it is a speed test of two-digit 
additions, it is likely that for many persons, partieularly the poorer 
ones, the score on the second test will be much better because of the 
practice on the first test. Of course this could easily be detected in the 
results because the mean would be larger on the second form. There 
are also other tests for which the performance on the second form is 
likely to be much worse than the performance on the first form. Any 
test that is fatiguing to the subjects would clearly fall in this category, 
and again such fatigue could easily be detected from the results. The 
average would be lower for the second test than for the first. 

If the foregoing rather obvious and easily detected difficulties were 
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not present, the major difficulty with reliability obtained by the suc- 
cessive administration of parallel forms is that it is too high. This is 
because there is no possibility for the variation due to normal daily 
variability to lower the correlation between parallel forms. Woodrow 
(1932) in his study of quotidian variability gathered evidence to show 
that there are day-to-day variations in test performance. 

Several other writers have pointed out that sometimes a low correla- 
tion between two parallel forms of a test indicates that the test is an 
unstable measure of a stable trait; at other times such a low correlation T 
may arise from a stable measurement of an unstable trait. Instability 
in either the test or the trait would result in a low correlation between 
parallel forms. Methods of determining the instability of a trait as 
distinguished from the instability of a test have been suggested by 
Paulsen (1931), Thouless (1936) and (1939), Preston (1940), and Jack- 
son and Ferguson (1941). We can conclude then that, if parallel forms 
of a test are given on the same day and if the statistical criterion for 
parallel forms is satisfied—namely, equal means and standard devia- 
tions—the reliability obtained is likely to be higher than that which 
would be obtained if day-to-day variability had also been allowed to 
‘affect the reliability. 

Generally speaking then, the use of two or three parallel forms admin- 
istered on different days is the best method of determining reliability of 
a test. However, since several parallel forms are frequently not avail- 
able, and since it is sometimes difficult to secure cooperation from sub- 
jects for an extended period, we shall consider the possibilities of ob- 


taining an indication of reliability when only one form of a test is 


available. 


3. Retesting with the same form 

Sometimes, when two parallel forms of a test are not available, it is 
possible to get an estimate of the reliability by administering the same 
test twice. Usually it is preferable to do this at rather widely separated + 
times. Again with this method we should watch out for a practice or a 
fatigue effect that would be readily detectable in most instances by 
observing the distributions of test scores for the first and second admin- 
istrations. Aside from such an effect, the major danger in such a tech- 
nique is that the reliability will be too high because there will be a ~ 
tendency for the subject to duplicate his former performance. That 
is, if the subject does not know the answer to an item, but makes a 
lucky guess and gets it right, he is likely to make the same guess next 
time and again secure credit for the item which he really does not know. 
Likewise, if he makes some minor mistake, and as a result answers in- 
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correctly an item that he normally would answer correctly, he is more 
likely to repeat this performance when the same test is regiven. Such 
an effect could not occur if the person were taking a parallel form that 
would not contain the same items. In other words, the performance on 
a repetition of a test is likely to be much closer to the original score 
than the performance on a parallel form of the same test. This method | 
of repetition of the same test at a different time should in general not 
be used, since it will give a spuriously high coefficient, and the amount 
of error is not easy to determine. 

The major exception is probably some simple perceptual discrimina- 
tions for which parallel forms cannot be devised. For example, a test of 
pitch discrimination or a test of auditory threshold for different. pure 
tones can probably be regiven without such an effect. The person sim- 
ply judges each time whether he hears a tone or whether he does not 
hear a tone. In such a test there does not seem to be a ready way in 
Which the person could spuriously duplicate his errors and successes of 
the previous set of trials. However, even in such simple tasks, it is fre- 
quently desirable to devise Several different measuring techniques and 
correlate them, as well as to get the reliability for a repeat test by the 
use of each method. In general we may say that, even where it seems 
that a repetition of the same form is all that can be done, it is well for 
the test constructor to use some ingenuity and to get at the given factor 
in several different ways that he believes are roughly comparable, and 
then to see how well the different tests agree. New light will frequently 
be cast on the function being measured in this way. See, for example, 
the tests of auditory discrimination used by Karlin (1942). Studies of 
performance on retesting with the same form have been made by Wood- 
row (1932), Jackson and Ferguson (1941), and Greene (1943). 


4, General considerations in split-half methods 

Usually, when only one form of a test is available, reliability is deter- 
mined by a “split-half” method. This means that the items of the one 
form are divided into two forms, each with half the number of items of 
the original form. Typically, the subjects do not know that the test is 
to be scored in two parts, and do not know which items are in which of 
the halves. The experimenter need not, and frequently does not, decide 
how the items are to be divided until he sees the test results. However, 
from the viewpoint of setting up efficient scoring procedures, it is desir- 
able to decide on the division into two subtests before the test is set up 
for printing. 

The methods discussed in previous sections (either the parallel forms 
method or retesting with the same form), provided the experimenter 
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with two scores. In such a case the reliability is given directly by the 
Pearson product moment correlation between the two scores. A sliehtl 
modified method is necessary when reliability is to be obtained rw e 
subtest scores obtained from a single test. One method is to correlate 
the two half scores, and then substitute this correlation in the Spearman- 
Brown formula for double length (formula 30, Chapter 6). We may 
write 


(1) Tar 


2ri» 
l-4 rs t 
where 7". designates the reliability of the total test as estimated by cor- 


recting the split-half correlation to double length, and 
7,9 designates the correlation between the two halves of the test. 


Another method of obtaining the reliability of the total test from 
information contained in two subtest scores is to use the formula pre- 


sented by Rulon (1939), 
(2) DES zu WI 


where s;? is the variance of z; — z», the difference of scores on the two 
halves of the test, 
sz” is the variance of scores on the total test, the sum of the scores 
on the two halves of the test (x = x, + 23), and 
7", is used to designate the test reliability as given by equation 2. 


Flanagan (19370) has suggested that the use of this formula in conjunc- 
tion with a test-scoring machine provides a rapid and efficient method 
of obtaining test reliability. 

If it is easier to calculate the variance of x; and the variance of xə 
than it is to calculate the variance of the difference (x, — 29), Trx may 


be written as r 
S? + s? 

3 "anaj 8, 

6) rs a 


where s+? is the variance of the 2; subtest scores, 
5," is the variance of the z» subtest scores, and the other terms 


are as defined in equation 2. 


Guttman (1945) derived this equation as lower bound (L4). He 
showed that, under the assumption that sı = s», this formula is identi- 
cal with equation 1. Guttman also points out that, since this formula 
gives a lower bound, it may be that in some cases the reliability coeffi- 
cient of a test has been underestimated, and that this fact may explain 
why correlations corrected for attenuation are sometimes above unity. 
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In order to prove that equation 2 is equal to equation 3 we note that 
the variance of a sum is equal to s? + sj? + 27128189, and that the 
variance of a difference is equal to sj? + ss? — 2ry98159. Substituting 
these expressions in equation 2 gives 

81° + ss? — 2ry28189 

S1? + sS? + 2rissis. 


Putting this over a least common denominator and simplifying, we have 


(4) pf s =] 


(5) y 4T1281S2 
E ppm Monae IUe Ei 
817 + s2? + 2ry98189 
If we put equation 3 over a common denominator in the same way we 
obtain ' 


2r198189 
(6) Poe 2 [5] , 


Sz 


which is identical with equation 5 derived from equation 2; hence the 
two formulas (2 and 3) for 7", are identical. 

Rulon (1939) has shown that r’,, given by equation 1 and 7", given 
by equations 2 or 3 are identical if the standard deviation of subtest 1 
is equal to the standard deviation of subtest 2. It may also be shown 
that, whenever s; z£ 5», 7", < r',,. -If we divide the numerator and 
denominator of equation 4 by so”, and write h for the ratio 81/89, we 
have 


k? — 2nris, 
(7) "m VU + 1 — Qrioh 


Vi eon 


By taking the derivative of 7^,, with respect to h, and setting it equal 
to zero, we can show that, for all positive values of h, T” zx is a minimum 
if h= 1. For students not acquainted with the calculus, we may indi- 
cate the condition for a minimum value of T”zz by the following alge- 
' braie transformations. If we put equation 7 over a common denom- 
inator, simplify, and divide the numerator and denominator by h, we 
obtain 
(8) re = 


Ari» 


; d 
h+- + 2r 
h 


By adding and subtracting 2 in the denominator, we may write 
Ari» 
(h — 1)? 
h 


(9) rU. = 
T 2-4 2ns 
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We can see by inspection that, if h = 1, 7", = sz; and, if h has any 
positive value other than unity, 7".. < r',,. Since the ratio of two 
standard deviations is always positive, it follows that »/,. = r”. an 


The corrected split-half correlation (as indicated in equation 
1) is identical with the reliability as computed by equations 
2 or 8 if the variances of the two halves are equal. I f the 
variances of the two halves are unequal, the corrected split- 
half estimate of reliability will be larger than the value given 
by equations 2 or 8. 


It should also be pointed out that the statistical tests of Chapter 14 
make it desirable not to use the correlation between two subtest scores 
for the estimation of reliability, but to divide the total test into three or 
possibly four parts, and to test the similarity of these parts as well as 
to obtain the correlation between them. These correlations can then 
be used in the generalized Spearman-Brown formula. (see equation 10, 
Chapter 8), with K set equal to three or to four and the reliability of 
the total test estimated. By using this method we know that we are 
using a correlation between parallel subtests as the basis for obtaining 
reliability. This means that the reliability found will not be too low 
because non-parallel subtests were chosen as the basis for estimating 
reliability. It is interesting to note that the use of more than two sub- 
tests in determining reliability has been suggested by Cureton (1931), 
Dunlap (1933), and Stephenson (1934). 1 

The major problem in using subtest scores for the purpose of estimat- 
ing reliability is dividing the original test into equivalent subtests. We 
shall next consider some of the methods of dividing a test into subtests, 
and the advantages and disadvantages of each. 


5. Successive halves or thirds 

Dividing a test into comparable halves or thirds is not a simple mat- 
ter. For example, the easiest way to divide the test is to take the first 
half of the test against the second half of the test. Often such a method 
will not result in parallel tests at all. For example, if the test is given in 
one session and is a timed test, any items that are not answered for lack 
of time will be in the second half of the test. The score on the second 
half will be lower than on the first half. For a speed test composed of 


- easy items the results of plotting score on first half against score on 


second half are very peculiar. All subjects who did not reach the second 
half would score zero on it, regardless of what their score was on the first 
half. If the test is a pure speed test, in the sense that the vast majority 
of subjects get the item correct if they try it, so that the only errors are 
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“Stems not yet attempted," everyone who finishes the first half gets a 
perfect or near-perfect score on it, regardless of his score on the second 
half. Figure 1 is such a scatter diagram. Clearly any correlation worked 
on such a diagram could not be interpreted as a reliability coefficient. 
Probably such a pure case will rarely be found. But wherever the score 
is in large part determined by the fact that time is called before many 
subjects have finished, this situation will be approximated, and the 
first versus the second half will not be “comparable halyes” 


suitable for 
obtaining an estimate of the reliability coefficient. 


Score on second half of test 


Perfect 
cor 
Score on first half of test Sore 
Showing the relationship between scores on the first and |. 
test for a pure speed test. 


Figure 1. ast halves of a 


It might be thought that, if all subjects finished two-thirds of the 
test, we could correlate the first third with the second third of the test, 
and correct this coefficient to triple length. However, such a method 
is valid only if the last third is parallel to the two matchi 
secured from the first two-thirds, If the difficult items are 
of the test, it is impossible to make any plausible guesses reg, 
would happen if the time limit were increased so that eve 
finish the test. Furthermore, such a method does not give tl 
of the test with the shorter time limit. Tt estim 


ng halves 
at the end 
arding what 
ryone could 


he reliability 
ates what the reliability 
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might be if the time limit were such that practically everyone finished 
the test. If the time limit is important, we must use a parallel form 
method of estimating reliability. If the time limit is generous so that 
most subjects finish the test, it may be possible to estimate reliability 
from subtest scores. 

In addition to the problem of time limits on a test, the problem of 
item difficulty must also be considered. Many tests are constructed 
with the easy items first, the items of average difficulty next, and the 
most difficult items at the end of the test. Clearly, if the items in the 
test are in difficulty order, the first and second halves will not be com- 
parable halves. 

It can be seen that, if a test contains a number of items of average 
difficulty, and is then lengthened by adding more very difficult items, 
the reliability of the test will decrease, despite increased test length and 
the increased testing time. The new added items will be answered on 
a chance basis by most of the persons; hence it will be a matter of acci- 
dent whether they get the new items right or wrong. As a larger num- 
ber of very difficult items are added, a larger component of the score 
will be due to guessing, and this component will decrease the reliability 
of the score on the augmented test. This in no way contradicts the 
Spearman-Brown formulation on the relation of test length to test re- 
liability, since in this formulation it was assumed that the new set of 
items was parallel to the old ones. This means that the items were of 
similar mean, standard deviation, and reliability. The new items sup- 
posedly added here would be difficult items with a lower mean; and, 
since they would be answered on the basis of chance, the reliability of 
this new portion and its correlation with the easier part of the test would 
each be near zero. 

From considerations such as these, we see that the effect of increasing 
the time limit on a test is difficult to predict. Increasing the time limit 
will permit subjects to answer more items; hence it may be thought of 
as increasing the effective length of a test. However, many of the sub- 
jects will not know the answers to the more difficult items at the end of 
the test; hence they will guess about these items and add a chance in- 
crement to their score. This increment will not remain stable from 
form to form; hence it will lower the reliability of the test. 

If we wish to use the first and second halves (or the successive thirds) 
of a test for computing reliability, it is possible to plan the test to over- 
come both the problems raised by time limits and by the difficulty 
ordering of items. For the first versus the second halves method, for 
example, we arrange the test items so that the item difficulty range in 
the first half of the test is duplicated in the second half of the test. 
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Then,.if sufficient time is given so that everyone, or practically every- 
one, has a chance to finish the test, the first and second halves will be 
comparable unless there is either a practice ora fatigue effect asa sub- 
ject goes through the test. If a test is given in two sessions, with time 
out between for rest and relaxation, if item difficulty is equated between 
the two sessions, and comparable time allowances are given for each 
session, it is probable that a good estimate of reliability can be obtained 
by correlating the first session with the second. 

For example, the University of Chicago comprehensive examinations 
are six-hour examinations given in two three-hour sessions, with two 
hours elapsing between sessions. It was found feasible to construct the 
examinations so that the subjects or topics covered in the morning ses- 
sion were dealt with again in the afternoon session, and consequently 
the morning and afternoon sessions were roughly comparable in length, 
in item difficulty (as judged by the committee constructing the examina- 
tion), and in topics covered. When the examinations were set up in this 
way, the correlation between morning and afternoon sessions was found 
to be about the same as for other selections of comparable halves, 

Likewise, for radio code receiving tests, use has been made of a corre- 
lation between errors on first versus second halves of the test. In this 
case the material received by the students is of about comparable diffi 
eulty all the way through, since the same characters (letters of the 
alphabet) are used, and they are sent at the same speed throughout; the 
test. Also there is no question of a differential time limit. All subjects 
must keep up with the rate of sending, or skip characters and start 
afresh at all times. Furthermore, these particular tests are short, only 
about three minutes in length, so that there is relatively little oppor- 
tunity for a fatigue effect. The tests are preceded by about ten minutes 
of warming-up practice by the students so that there is Probably little 
consistent improvement from the early to the later parts of the test, 
Tt should be noted that for these radio code receiving tests the best; 
method of testing reliability is provided by a parallel form, but it should 
be given very shortly after the first form in order to avoid the effects of 
practice. The method of using errors in odd versus even words would in 
this case result in a spuriously high reliability, since the making of an 
error in one word is likely to throw the student off a little, and he is 
also likely to miss the next word or two before he gets “back into stride” 
again. Thus it is likely that the correlation between errors made on 
odd- versus even-numbered words would be considerably higher (and 
spuriously higher) than the correlation between parallel forms of the 
test, administered under comparable conditions with a relatively brief 
period between the two tests. Also each test should be given with 
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suitable provision for a warming-up period that is not counted on the 
test score. 


6. Odd versus even items or every nth item 

By far the commonest form of comparable halves is the odd-even 
items division. It is probable that this method never gives too low a 
value for the reliability coefficient. If there is an error it is always in 
the direction of a reliability that is spuriously high. Sometimes, as will 
be shown, the odd-even reliability seriously overestimates the test re- 
liability as indicated by parallel forms. 

It can readily be seen that, if the items are in difficulty order, the odd 
items will have about the same average difficulty and spread of difficulty 
as the even items. If there is any bias, it is likely to be that the odd 
items will be on the average very slightly easier than the even items. 

In using this method, however, we must be certain that there is no 
dependence of one item on another. For example, in the radio code 
tests just mentioned, success or failure on one item—particularly fail- 
ure—is very likely to influenee the performance on the next item. In 
some tests we find a series of questions on a given topic, and it is some- 
times difficult to decide whether the items are independent, in the sense 
that knowing the answer depends primarily upon whether or not we 
have studied the topic or whether there is a spurious dependence, as in 
the case of errors in radio code. In performance tests, where the subject 
js to assemble or disassemble a mechanism, and is graded on the various 
steps, there is very likely to be a spurious relationship, in the sense that 
the subject learns or does not learn a certain set of acts as a unit while 
the examiner, in order to grade the performance objectively, sets up 
numerous rather artificial divisions. In such cases as these, it seems 
that the fair test to apply is: “Would you as an examination constructor 
set up such halves as separate tests?” In performance on assembly of 
apparatus, it is doubtful if the test constructor would want the students 
to go through the entire performance, as would be necessary, and grade 
them on only half the points that it was possible to observe. In a set 
of statements describing the characteristics of rods and cones in the 
eye, for example, it is rather likely that the test constructor would 
assent to using half the statements for a shortened form of the test. It 
might readily be, however, that the odd items would not constitute a 
satisfactory parallel form for the even items. The items should be in- 
spected to insure that the type of subject matter and the difficulty dis- 
tribution for one of the halves are roughly paralleled in the other half. 

Odd-even correlation is also spuriously high on a test with a rather 
stringent time limit so that a large number of subjeets do not finish the 
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test. Ifa subject fails to answer the last ten items in the test, obviously 
he “misses” all of them. Thus he gets five points more on his “odds” 
error score, and likewise five points more on his “even” error score. It 
is highfy probable that careful observation will show that many of the 
published reliabilities are spuriously high because of this factor. Again 
this type of error can be strikingly illustrated in the “pure speed" test 
to which previous reference has been made. If every subject gets all 
items correct as far as he goes, one who finishes ten items will have an 
odds score of five and an evens score of five. If he finishes eleven items, 
he will have an evens score of five and an odds score of six, and then at 
twelve items the score will be six and six. That is, the odds and evens 
scores will either be identical, or the odds score will be one point greater 
than the evens score. The correlation scatter plot will appear as in 
Figure 2, and the correlation will be well over .99. Again such a pure 
case probably never occurs, but an approximation to it (coupled with a 
spuriously high reliability) occurs whenever the odd-even method is 
used, and not all the subjects finish the test. 


Even - half 


0 1 2 3 4 5 6 


Odd-half 


Figure 2. Showing the plot of odd versus even items for a pure speed test. 


7 8 9 3j dH 


It should be noted that the odd-even reliability is probably still too 
high, even when the items are in order of difficulty, and all persons are 
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allowed to finish the test, and the items are independent of one another 
(in the sense that making a mistake on one item does not of itself in- 
crease the probability of making a mistake on another item). Any 
variability due to day-to-day variations in ability is ruled out, and even 
the variation that might be caused by a slight practice or fatigue effect 
as we progress through the test is ruled out. If we use the parallel 
forms as a standard, the odd-even reliability, as generally applied to 
most tests, probably gives a result that is too high, owing to the careful 
control of various other sources of variation and also to the fact that 
most tests are timed tests having a fair proportion of the score depend- 
ent upon most of the subjects not getting a chance to try the last items. 
To the extent that a test is a speed test, the score depending on how 
rapidly a subject works in the given time limit, there is no way of esti- 
mating the reliability except by testing a second time with a parallel 
form. 

There have been several studies comparing the odd-even with parallel 
form reliability and with the correlation between test and retest 
(with the same test). See Foran (1931), Jordan (1935), Goodenough 
(1936), Remmers and Whisler (1938), Ferguson (19410), Greene (1943), 
and Jackson and Ferguson (1941). These experiments show clearly 
that different methods of measuring reliability give different results. In 
general, the parallel form reliability is lowest, and the odd-even (cor- 
rected) is the highest. 

It might be thought that, if everyone finished two-thirds of the test, 
we could use an odd-even reliability on the first two-thirds, get the 
correlation between these two thirds, and then correct to triple length. 
However, this gives an estimate of the reliability of the total test on 
the assumption that everyone finishes the test. It does not give any esti- 
mate of the extent to which a given subject will hit the same speed rate 
on different administrations of the test; and hence will get to the same 
point in the test. There is no possible way to estimate this factor accu- 
rately except by giving parallel forms with comparable time limits and 
under standard directions, and then observe the extent to which the 


score is the same. 


7. Matched random subtests 

If a single test score for each subject is to be used in estimating test 
reliability, it is necessary to regard this single score as divided into two, 
three, or four equivalent subtest scores. In the preceding sections we 
have seen that under certain conditions successive halves or thirds of a 
test can reasonably be regarded as parallel forms, whereas under other 
conditions the successive segments of a test are clearly not ‘parallel 
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forms. Similarly, assigning every second or every third item to one of 
two or three subtests is sometimes a good method for obtaining parallel 
subtests, and sometimes a poor method. 

If a test is composed of a large number of independent items and is 
administered with a liberal time limit, it can usually be divided into 
parallel subtests. If the test has only a few item groups in it, as, for 
example, in most mechanical assembly tests or in tests involving the 
writing of a paragraph in English, it may or may not be possible to de- 
vise a test that is composed of parallel subtests. If a single time limit 
that is short is used, there is no possibility at present of getting any 
valid estimate of the reliability of such a test by using subtest scores of 
any sort. 

If item analysis data are available on a test (that has a large number 
of independent items and a liberal time limit), items may be matched 
on such item analysis data and assigned to the subtests. This is an 
excellent method of insuring that the subtests will be parallel. For 
example, suppose that the percentage of persons answering the item 
correctly (p) and a biserial or point biserial correlation with total test 
score (r) are available for each item. The best procedure for construct- 
ing parallel subtests is to represent each item by a point on a scatter 
diagram, the abscissa of which represents p and the ordinate r, In 
order subsequently to identify the items, each point should be num- 
bered with the item number, as shown in Figure 3. Items may then be 
simultaneously matched on p and r, and a ring drawn around the 
matched pairs, triples, or quadruples, as shown in the diagram, It is 
important to note that, if the test is heterogeneous with respect to item 
type or with respect to type of subject matter covered, it is important 
to match items for subject matter, item type, etc., as well as for 
p and r. 

One member of each group should then be randomly assigned to a 
given subtest. For example, if only two subtests are to be formed, the 
assignment could be made by tossing a coin, and assigning the lower 
numbered item of the pair to form A if the coin showed heads and to 
form B for tails. In constructing three parallel subtests, it is necessary 
to assign each triple of items to the three parallel subtests by a some- 
what more complicated procedure. For example, the items in each 
triple may be identified by item number, as low, medium, and high 
(L, M, and H). There are then six possible ways of assigning these 
three items, one to each of three subtests. Each such order may then 
be assigned a number from 1 to 6 (1 = LMH, 2 = LHM, ete.), and each 
triple assigned according to the throw of a die. , 

If swith item analysis information is available before the test is assem- 
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bled, scoring routines are much simplified if the items of one subtest 
are put first, another second, ete., or else if items from the different 
subtests are distributed successively through the test. For example for 
three subtests, A, B, and C, the items might be arranged ABCABG, ete. 


1.00 


80 


AO [— 


Item -test correlation (r,;, ) 


20 


20 40 60 80 1.00 


.0 E 
Proportion passing item (p) 


Fi:GumE 3. Showing how to construct three parallel subtests or tests by simulta- 
neously matching items on a difficulty and reliability index. 


It must again be emphasized that no matter which order of items is 
used, it is necessary to allow time for almost all students to finish almost 
all items. It is not possible to estimate test reliability from parallel 
subtests if the test score is markedly influenced by the time limit. 

An analogous method may also be used in attempting to build a 
second test to match a first one already in use. Figure 4 illustrates the 
use of such a procedure in developing an aptitude test for the U. 8. Navy. 
In this case the items for form 1 were already in use when form 2 was 
constructed, so that the two forms could not be matched as well as if it 
had been possible to set up two forms simultaneously. * 
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75, = Relationship between answer to test item and total score 


Legend 
9 Form 1 items 
| A X — 2 selected items 
WU X — 2 discarded items 
05 .10 .15 .20 25 .30 .35 440 .45 .50 .55 .60 .65 70 5 .80 .85 .90 .95 
P = Proportion passing item 


Ficure 4. Selecting items for a second form of a test to mate! 

selected for a first form. Item analysis data used in the sel 

GCT Form 2 Analogies. [From Satter (1944), OSRD Re 
chology Panel, NDRC.] 
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In conclusion, it should be pointed out that a statistical criterion for 
parallel tests is now available. We should use it on the subtest scores to 
find out whether or not the precautions used in constructing the sub- 
tests resulted in parallel sets of scores. In order to make complete use 
of the methods of Chapter 14 in testing covariances as well as variances, 
it is necessary to have three subtest scores. 
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8. Reliability of essay examinations 

In dealing with the reliability of essay examinations, we encounter 
certain special considerations that are not involved when determining 
the reliability of objective examinations. 

One major problem in essay examinations is the accuracy with which 
the same (or different) readers will grade the examinations. The usual 
method of checking on the accuracy of reading is to have two or three 
different readers assign a mark to the examination independently of each 
other. This means that the readers will agree before hand on the dif- 
ferent marks to be assigned, and on the type of papers to which each 
mark is to be given, and then each reader will record his marks on some 
sort of master list of students. It is necessary to be certain that one 
reader does not see any of the marks assigned by another reader, and 
that an earlier reader does not make any marks on the paper, since these 
could be seen by and perhaps influence a later reader. 

These marks, independently assigned to each of a set of papers by 
different readers are then correlated. This correlation between the 
marks of two readers is known as “the reader reliability" of the exam- 
ination. It should be noted that the correlation between marks of two 
readers may be high, even though the means and standard deviations of 
the marks may be radically different. If the mean of reader A is higher 
than the mean of reader B for a given set of papers, it indicates that A 
is an "easier grader" than B. Such a difference can be adequately 
taken care of by adding a constant to each grade assigned by B, or sub- 
traeting a constant from each grade assigned by reader A. Correspond- 
ingly, if reader A has a larger standard deviation for his marks on a set 
of papers than reader B, this can be corrected by multiplying B's marks 
by an appropriate constant, or dividing A's marks by some constant. 
A difference in mean and standard deviation is not serious provided the 
correlation between the marks is high. That is, the papers need not be 
regraded by the readers, but it is essential, in order to have a fair mark- 
ing system, that the marks of the different readers be equated in mean 
and standard deviation before being used further. If the reader reli- 
ability is low, there is no way of equating marks; it is necessary for the 
readers to discuss their differences of opinion and to regrade the papers 
before the marks can be used. 

A more precise method of handling the problem of comparability 
among readers is to use the methods of Chapter 14 to analyze the results 
of several different readers. The test with Lj, will indicate immedi- 
ately whether or not the means, variances, and correlations among the 
set of readers may be regarded as identical or not. If Liye is near unity, 
we have only to inspect the magnitude of the correlations to see if they 
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are high enough to be satisfactory. In general, we should strive for a 
reader reliability over .90 if it is possible to achieve this level. It would 
seem that a reader reliability of less than .80 is so low as to necessitate 
further discussion and alteration in methods of reading. It should be 
remembered that the reader reliability is an upper bound for the test 
reliability. If two readers looking at the same paper agree to the extent 
of .80, for example, then if different questions (parallel questions) on 
the same material were read by different readers, the agreement is prac- 
tically certain to be much less than .80. 

In order to make clear the method of comparing reliability of essay 
examinations with that of objective examinations, we shall consider 
what has been termed “content reliability” (see Gulliksen, 1936). In 
an objective examination scored with a key that has previously been 
agreed on by all persons concerned, the equivalent of “reader reliability” 
is unity. Any difference in scores between parallel tests is due to dif- 
ferences in sampling of subject matter content in the two tests, and to 
possible changes in the subject between the time of administration of 
the two tests. If two parallel essay examinations are matched just as 
successfully with respect to content as are two parallel objective exam- 
inations, the correlation between the two parallel essay forms will 
practically always be lower than between the two objective forms, 
owing to the fact that the unreliability of reading will still further lower 
the correlation between the two essay forms. In order to determine the 
extent to which the low reliability of an essay examination is due to 
poor agreement among readers or to poor matching of questions in 


parallel forms, it is necessary to determine the content reliability of the 
essay examination. 


For one form of an examination, let us use: 

x’, to indicate the score assigned by reader L, p 

2^» to indicate the score assigned by reader 2, and 

x’, to designate the correct score that the student should have re- 
ceived on the content of his paper, if it had not been for reader 
errors. It should be noted that z^, is not the "true score.” Tt 
is comparable to the score on an objective examination and like 
such a score has a true component, and an error due to the un- 
reliability of sampling, unreliability of student performance 

e’, designates the error made by reader 1, and 

e'a designates the error made by reader 2. 


e, etc. 


For the parallel form we shall use x/’,, x” 


2; T'e, "1, and e's, all de- 
fined as above, but for the second form of tl 


he test. It is assumed that 
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(10) wy = Pet e, 
(11) T'a = Te F es, 
(12) ey = x", +e, 
and 

(13) x" = x", + e". 


It is possible to compute the correlations Taz, and Tanay Which are 
reader reliabilities, and also to compute the four correlations of the 
form rz,,", Where a = 1 or2and b = 1 or 2. These are test reliabilities, 
eee by the inaccuracy of reading. The problem is to express 

zæ", (the content reliability of the test) as some function of the known 
Oates. 

First we may obtain the relationship of the reader reliability to the 
variance of z^;, and t'e. 

Eaux. 


(14) Tata = 


N SzSz’ 
If we assume that the two readers are reading with equal accuracy, we 
may substitute sx, for sy, Also for z'; and 2’s let us substitute their 
values as given in equations 10 and 11, obtaining 
Z(a'e + e1) (t'e + e'a) 
5 Triz = . 
(15) za Ney, 


If we expand the numerator and assume that the correlations e';e'o, 
e'j2^,, and ez", are equal to zero, we have 


EE 
(16) : Tz'iz' = Wee. m 


If we write s?,, for Z(';)?/N and take the square root, we have 


S 
VTS I 


(17) 3 Sz 


By a similar procedure for the other test, and the other reader, we have 


VERA S 
Taris" = E 


(18) lr 


The correlation between the two forms is r»,,",. There are four such 


correlations for different values ofa and b. Let us assume that all may 
be regarded as equal, that is, that the tests are parallel and that the 
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error variance due to reader is equal for each reader and each form; then 


one of the four correlations, say 7,,,", may be taken as typical of the 
group. We have then 


Za ux". 
(19) Tata = i: 


Nszgs, 
Substituting equations 10 and 13 in the numerator gives 
Z(a'e + e) 2", + e2) 


Noy,82", 


(20) 


[UA 


Expanding the numerator and noting that the correlations of reader 
errors with each other and with test score (xe) is zero, we have 


ze 
(21) aper NER 


Tuus = 
N58, 


By the usual definition for correlation this becomes 
Tarer Sz! Sre 
(22) Tanara = Li Ap E. 
S282 


Substituting from equations 17 and 18, we have 


(23) Patty = Taa” V Tests M Tarza 
Solving for tsz”, the content reliability, we have 
TUN 
(24) Tua m T. 
Ta^! lays 


It will be noted that equation 24 is identical in form with the correc- 


tion for attenuation, equation 21, Chapter 9. It gives the correction for 
the "attenuation due to inaccuracy of reading.” 


The reliability of an essay test corrected for attenuation due 
to the inaccuracy of reading has been termed the content reli- 
ability of the essay test. The content reliability is equal to 
the correlation between parallel forms divided by the geometric 


mean of the reader reliabilities of the two forms. (See equa- 
tion 24.) 


9. Summary 


Three main methods of determining reliability have been considered. 
1. Parallel forms. Generally speaking, this method is best, provided 
that we can regulate the interval between the two tests and the activity 
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of the subjects during that interval so that the influence of practice, 
fatigue, and other similar effects will be negligible. If three parallel 
forms are used, and the statistical criterion for parallel tests given in 
Chapter 14 is applied, score changes due to practice, fatigue, etc., are 
detected immediately and routinely. 

It should be especially noted that, if the score variance depends in 
any large part on unanswered items at the end of the test (for example, 
if speed is an important factor in test score), it is necessary to use the 
parallel forms method. Neither of the other two methods is satisfactory 
in this case. 

2. Retesting with the same test. This can be done particularly well 
in such tests as sensory limen or discrimination tests, in which it is not 
likely that the subject will remember and recognize the individual 
items. As with parallel forms, it is necessary for the experimenter to 
control both the length of time between tests and the activity of the 
subjects during the interval so as to rule out practice, fatigue, and sim- 
ilar effects. If the test has a distinct speed component, it is very un- 
likely that the same form can be repeated with no score change such as 
those produced by either practice or fatigue. However, if the statistical 
criterion for parallel tests is routinely applied, such score changes are 
detected immediately. 

3. Some variant of the split-half or parallel subtests method. If only 
one form of the test is available, and it is not possible or desirable to 
repeat the test with the same group of subjects, it is possible to consider 
using one of these methods. Such methods cannot be used unless the 
test has a liberal time limit. It is also desirable, but not always essen- 
tial, that the test have a large number of independent items. If three 
or more parallel subtests are used, then again the criteria presented in 
Chapter 14 will show whether or not parallel subtests were obtained. In 
many instances, first versus second halves or odds versus evens will form 
satisfactory parallel subtests. The most certain method, however, is to 
match groups of items on statistical and other criteria available, and 
then to assign randomly each member of a group to a different parallel 
form. This matching and randomizing method gives excellent results. 
If information is available for such matching before the items are 
arranged in the test, it is possible to be certain that either the successive 
halves (thirds) of a test or the alternate items will be parallel subtests 
of representative items from the total test. e 

In studies comparing these three methods of obtaining test reliability, 
it is generally found that the parallel forms correlation is the lowest and 
the "corrected odd-even" reliability is the highest. 

When either of the first two methods is used, the correlation between 
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two sets of scores is the reliability coefficient of the test. When the 
split-half method is used, it is necessary to substitute the obtained cor- 
relation (r12) in the formula 


2r. 
a) Hor ut 


1-4nj' 


in order to obtain the test reliability; or to obtain the variance of the 
difference scores (xı — x2) and use the formula 


(2) ma =1——; 


or to obtain the variance of each half and use the formula 


" Cie dE Sut 
(3) Pan a| -2 le 
Sz 


which is identical with equation 2. 

It was shown that, if sı = so, then r,s = Miras whereas, if 81 ¥ So, 
then r7, > 7:2. If three parallel subtests are used, the obtained ped 
relation is substituted in the formula 3r/(1 + 2r) to obtain the reliabil- 
ity; and correspondingly, for greater numbers of subtests, equation 10 
Chapter 8, should be used with K set equal to the number of subtest 
scores. 

The last section presents some special considerations related to the 
reliability of essay examinations. In addition to the usual sources of 
error in objective examinations, inaccuracy of reading contributes to 
lowering the reliability of essay examinations. The correlation between 
scores assigned to the same set of papers by two different readers is 
known as reader reliability. The agreement in means and variances, as 
well as in correlations, can also be assessed by the methods of Chap- 
ter 14, provided the examinations are read by three or more readers, 

The correlation between two parallel forms of an essay examination 
when corrected for the attenuation due to inaccuracy of reading Nas 
called the content reliability (rs) of the essay examination. Tt is 
given by 

ee 


(24) UE Ped a 
V^ yx! Tutte 2 
where 7,2”, is the correlation between assigned scores on the parallel 
forms, 


T,',", 15 the reader reliability for the z’ Scores, and 
T,",5", is the reader reliability for the x” scores, 
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Problems 


1. If the correlation between the first and second halves of a test is .70, what is the 
reliability of the test? 


2. If the odd-even correlation for a test is .83, what is the reliability of the test? 


3. If the reliability of a test is .92, what should be the correlation between two 
parallel halves? 


4. If the reliability of a test is .97, what should be the correlation: 


(a) Between two parallel thirds? 
(b) Between two parallel fourths? 


b. If the standard deviation of the total scores is 45.3 and the standard deviation 
of the "odds minus evens" score is 12.1, what is the test reliability on the assumption 
that odds and evens are parallel subtests? 


6. If x and y are parallel tests, the variance of x — y is 73.28, and the variance of 
w+ y is 841.56. 


(a) What is the reliability of the x + y score? 
(b) What is the reliability of the x score? 
(c) What is the reliability of the y score? 


7. Consider each of the following tests that are to be administered to a group of 
high school seniors. For each test, three methods of estimating reliability are being 
considered: (a) the parallel form method, (b) the odd-even method, and (c) the first- 
second halves method. For each test indicate whether each of the methods is suit- 
able or not, and explain why. 

Test A. Two-digit additions, 100 items 2-minute time limit. 

Test B. Current events information questions, 50 items 25-minute time limit. 
Each item answered correctly by 65 to 75 per cent of students. They are not ar- 
ranged in order of difficulty, but in a random order. 

Test C. A series of 30 mathematical reasoning problems, ranging in difficulty from 
some that are answered correctly by 90 per cent of the students to others answered 


correctly by 20 per cent. The problems are arranged in order of difficulty; one hour 


is allowed for the test. f h 
Test D. A test of differential brightness acuity. Two lights are flashed simul- 


and the task is to indicate which is brighter. The items have a large 
difficulty range and are presented more or less in order of increasing difficulty. The 
test contains 50 items presented at the rate of one every 5 seconds. 

Test E. A shorthand dictation test, a 1600-word passage read at the rate of 80 


taneously, 


words per minute. 

8. An examination has an objective section (o) and an essay section (e). The 
split-half correlation of scores for section o is .80. The corresponding correlation for 
section e is .60. On section e the correlation between the total score given by reader 
A and reader B is .78. Estimate the content reliability: 


(a) For section 0? 
(b) For section e? 
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9. The following data on 52 students taking the composition section of the French 
104-5-6 in June 1940 at the University of Chicago were made available by Dr. 


Lawrence Andrus. 


25 
26 


Column I gives the code number of each student. 

Column II gives the total score for each student. 

Column IIT gives the score on the first fifty items for each student, 
Column IV gives the score on the second fifty items for each student, 


II 


41 
40 
73 
39 
74 


49 
35 
59 
44 
51 


55 
54 
36 
74 
48 


52 
66 
73 
59 
50 


25 
60 
60 
65 
41 
65 


II 


24 
22 
40 
20 
37 


31 
20 
33 
28 
25 


26 
31 
25 
35 
29 


28 
42 
39 
33 
26 


18 
31 
34 
34 
18 
35 


Number of items = 100 


Maximum number of score points = 100 


Mean raw score = 54.52 


Standard deviation = 13.98 


Number of students = 52 


From the foregoing data calculate: 


(a) The reliability coefficient for the total test by 
Spearman-Brown correction. 


62 


III 


32 
24 
31 
32 
32 


30 
29 
36 
30 
29 


32 
31 
30 
29 
43 


20 
22 
28 
31 


15 


36 
24 
26 
35 
27 
41 


IV 


26 
11 
24 
30 
36 


25 
33 
31 
23 
25 


29 
37 
28 
31 
41 


19 
15 
28 


the split-half method using the 


- 
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(b) The reliability coefficient of the total test from the variance of the total score, 
and the variance of the difference between scores on the two halves. 

(c) The reliability coefficient of the total test from the variance of the total score, 
the variance of score on the first fifty items, and the variance of scores on the 
second fifty items. 

(d) The standard error of measurement for this test. 

(c) What are reasonable limits for the true score of a person who scores 51 on the 
total test? One who scores 73 on the total test? 

(f) Estimate the reliability for a comparable test, twice as long as this one, four 
times as long, seven times as long, ten times aslong. (See table for Spearman- 
Brown formula in Dunlap and Kurtz, 1932.) 

(g) If one wished this test to have a reliability of .97, how long would it be necessary 
to make it? 

(h) Graph the results of problems f and g. 

(i) Estimate the reliability this test would have if it were applied to a group whose 
scores had a standard deviation half that of the original group. 

(j) Estimate the reliability this test would have if it were applied to a group whose 
scores had a standard deviation twice that of the original group. 


16 


Reliability Estimated 
from Item Homogeneity 


1. Introduction 

As previously indicated, the original approach to the problem of 
reliability by Spearman was based on the correlation of parallel tests. 
Kelley (1942) has pointed out that, according to this concept, the major 
function of the reliability coefficient is to evaluate the Judgment of the 
test constructor, to indicate whether or not two forms thought to meas- 
ure the same thing do in fact measure approximately the same thing. 
Recently there have been several other approaches designed to measure 


\ the homogeneity of the items in a test. It should be noted that, if two 


tests have each a high "homogeneity index" while the correlation be- 
tween them is low, we have a distinctly disturbing situation. The indi- 
cation would perhaps be that a homogeneous field existed but that the 
test constructor did not know enough about that field to construct two 
parallel tests, clearly an unsatisfactory situation. Likewise, suppose 
the “homogeneity index” is very low, but the test constructor is able to 
set up a different form, a parallel test that correlates highly with the 
first form. Here it would seem that the situation is satisfactory. The 
field is not unitary, but the test constructor knows the field well enough 
to set up different tests and have them agree. In short, if a parallel 
form reliability is high, the situation is satisfactory; if the parallel form 
reliability is low, the situation is unsatisfactory, regardless of what hap- 
pens to the index of homogeneity. 

One approach to the problem of item homogeneity is to make a 
factor analysis of the inter-item correlations for a test. If there is only 
one common factor, the items are homogeneous. If the analysis reveals 
more than one common factor, it might be desirable to consider dividing 
the test into parts, each of which represented a Single common factor. 
Such a method would be extremely laborious for any very long test. 
Carroll (1945) has shown that the point biserial correlation cannot give 
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a one-factor solution if the items differ in difficulty. He has suggested 
that the two-by-two scatter plot for each pair of items be corrected for 
the effect of guessing and that the tetrachoric correlation coefficients 
then be used for a factor analysis. Such a suggestion will probably not 
be widely adopted until very much more rapid methods of obtaining 
correlations and factor analyses are available. 

The use of methods of analysis of variance for the solution of the 
problem of reliability has been suggested by many writers; see Johnson 
and Neyman (1936), Jackson (1939), (1940b), and (1942), Hoyt (1941), 
and Alexander (1947). Jackson and Ferguson (1941) show how the 
analysis of variance can aid in separating out and assessing the various 
sources of unreliability in a test. The consideration of such methods 
and such a detailed analysis of sources of unreliability are beyond the 
scope of an elementary discussion. 

It has also been shown that the homogeneity of a set of items may be 
assessed by comparing the standard deviation of the test with the stand- 
ard deviation to be expected from items of the same difficulty that are . 
correlated zero, or correlated perfectly with each other. This approach 
has been developed by Loevinger (1947). i 

Kuder and Richardson (1937) developed several methods of assessing 
the homogeneity of a set of items without the use of a parallel test. 
Further studies of these methods have been presented by Richardson 
and Kuder (1939), Dressel (1940), and Kaitz (1945a). The use of the 
Lexian ratio in measuring reliability was suggested by Edgerton and 


TI 3 H E H 
ena ae and (1946) has presented a theory of reliability in 
terms of estimation of lower bounds for reliability. His view is that Da 
upper bound for reliability of any test is always unity, a ad E 
quently a lower bound can be determined that is far "ibi ai E 
and near enough to unity to be of use. He has presente anumpon of 
both quantitative and qualitative 


different lower bound estimates for 
data. m 
In this chapter, we shall present only two of these NET ae epe 
ods of estimating reliability by means of an index of item homogeneity 
which i n require the division of a test into po pue 
E [ = ie requiro the number of test items (K) and the standard 
deletion of xm indi (Sz). One method requires, in addition, the average 
item variance; the other requires the test mean. 
2. Reliability estimated from item difficulty and test variance 
If ^ 3 4 two tests that are parallel item for item, the intercorrela- 
tion reliability) of these tests may be written as follows. Let us use x 
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for one test and y for the other test. The items are designated by sub- 
scripts from 1 to K. 


Equation (1) 
N 


D (tu + tai d tr) Ui + yor +--+ yk) 


i=l 


Tz. = To N : 

y: (zii + 22: d ox)? 122 Qni + ya beet Vk? 
i=1 i=l 

Expanding and collecting terms in the numerator, and noting that 

since we are dealing with parallel tests the two factors in the denomina- 

tor are equal, we have 


i 


K N N XX 
2» 2 LeiYei + 2 X 2j Veiljai (g = h) 


= 
(2) TzzZy = K K K 
D re? + 25 25 Wes (g = h) 


Dividing through by N and writing the results in terms of correlation 
and standard deviations, we have 


K £E ox 
p» TapueSzeSue + 2 25 Treun SzeSun (g = h) 
= 


g=1 1 h=1 
(3) TzzXy = 


K K K 
H 
g=1 s=1 h: 


; TunPa (g =h) 

Since s,, and s,, are standard deviations of parallel items in two forms 
of the test, we may assume that they are equal. Likewise, since T2 y, 18 
the correlation between two parallel items, it is a reliability coefficient, 
and may be written fgg. In general, we may now drop the distinction 
between x and y and retain only the subscripts that denote whether we 
have the same or a different item. This change gives 


K K K 
È res? + 25 25 Tae (g = h) 
(4) F= g=1 g=1 h=1 
CY K K K . 
2 
2)s + D 3) rase (g Æ h) 
s=1 g=1h=1 


It will be noted that the denominator is the variance of the total test. 
Since the numerator and denominator are alike except for the first 
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eo 
term, we may designate the total test variance by s;? and write 


x K 
2 
-— 2 
s? — 2) s + 2 rase? 
gent g=1 


(5) TZzXz 
sz 


Since we do not actually have two tests that parallel each other it 

for item, it is necessary to make some assumption in order to hi d 
value for Tee. The simplest and most direct assumption is th t the 
average TggSg , which is the covariance between parallel items, i É al 
to the average (rgaSesa), Which is the covariance between Bonds 


items. That is, 


K 
(6) NEL oo 
z i Rot (g Ż h). 
Since 
= EE K E 
(7) Ss = b» sg + ` -— 
éa g=1 h=1 
we may write 
K 
K S; — > a 
(8) UT PETEN I a eae 
2 gg?g K = 1 


Substituting this value in equation 5 gives 


K 
Se — 55s 


sz -- 25 E : g=1 
gel K-1 
S i 
(9) T3222 7 "E 
We may write rrz for the reliability of the test and simplify equation 9 to 
K 
s 2 
K | Er i 
is ns cen oar al 


y coefficient of the test, 
f items in the test, 


em g (equals Pe 
the item correct), and 


where rez is the reliabilit 
K is the number 0 
sg? is the variance of it 
percentage getting 

s,? is the test variance. 


(1 — pg, where p is the 
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It should again be noted that the only assumption made in deriving 
this equation was that the average covariance among non-parallel items 
was equal to the average covariance among parallel items. 

In terms of item difficulties or percentage passing a given item, we 
may write 


K > (pe — Pe) 
(11) Tor = ( ) [E : 
K-1 "E 


where p, is the proportion passing a given item, and all other terms arc 
defined as in the preceding equation. 


If the test variance, the number of items in a test, and the per- 
centage of persons correctly answering each item are known, 
and if the test score is the number of items answered cor- 
rectly, a lower bound for the reliability coefficient of the test 
is given by equation 10 or equation 11. These equations are 
based on only one assumption, that the average covariance 
between non-parallel items is equal to the average covariance 
between parallel items. 


It should be noted that formulas 10 and 11 are identical with “for- 
mula 20,” derived by Kuder and Richardson (1937), with formula 29, 
in Chapter V of Jackson and Ferguson (1941), and the formula for L3 
given by Guttman (1945). However, the assumptions used for the 
derivation were radically different in these three papers. Kuder and 
Richardson assumed that all inter-item correlations were equal. Jack- 
son and Ferguson, however, showed that it is necessary only to assume 
that the average covariance between parallel items is equal to the aver- 
age covariance between non-parallel items. "They also showed that the 
assumptions made by Kuder and Richardson (1937) were not only 
unnecessarily restrictive, but were in some cases internally inconsistent. 
Guttman demonstrated that the value given by equation 10 is a lower 
bound to the reliability coefficient. 

If the item-test correlations or the inter-item correlations are known, 
it is possible to use this information in more complex formulas to obtain 
better estimates of the test reliability. Such formulas have been given 
by Kuder and Richardson (1937) and by Guttman (1945). "These 
formulas are not given here since it seems that formula 10 is usually 
quite satisfactory. As a result of some empirical studies, Richardson 
and Kuder (1939) recommended their “formula 20" as the best one 
to use. 
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3. Reliability estimated from test mean and variance 


` Let us consider the simplified Kuder-Richardson formulation that is 
obtained by assuming that all items are of the same difficulty. In this 
case it is possible to estimate Ss,” or Z(pq) from the mean of the test. 


The number of subjects getting item g correct is Np,. The sum of these 


terms over all the items is the total number of correct answers, N >> pg. 

g-1 
Since the total number of correct answers divided by the number of 
subjects is the mean of the test, we have 


K 
(12) Mz = X Ps, 
g=1 


or using to designate the average item difficulty, we may write 

(13) M; = Kj. 

Likewise the sum of the variances (Es?) may be written 

(14) Dp — Sp? = Kp — Kp. 

If all items are the same difficulty, the average of the squares will be 
equal to the square of the average, and we may write 

Mg 

me 


(15) Dp — Dp? = M, — 
If we substitute this equation in the numerator of equation 11, we have 


M Mg 
K OWENS 


ES ih= 
(16) Tzr mu " 


[3 


where 7, is the reliability of the test, 
K is the number of items in the test, 
M, is the test mean, and 
s,2 is the variance of raw scores on the test. 

This formula is identical with the Kuder-Richardson “formula 21.” 
ven here uses the same assumption as equation 10 or 
equation 11 plus the assumption that all item difficulties are equal. 
Formula 16 has the advantage of being very simple to calculate, since 
it uses only the mean, variance, and number of items. Also it has the 
advantage of being a lower bound so that we can by the use of this 
formula quickly satisfy ourselves that a given test is performing fairly 


The derivation gi 
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well. This formula gives an exact figure for the reliability if all items 
are of the same difficulty level. If the items in an examination have a 
wide difficulty range, formula 16 gives an unsatisfactorily low figure 
for the reliability. 


If only the mean, standard deviation, and number of items 
in a lest are known, and if the test score is the number of 
items answered correctly, a lower bound for the reliability co- 
efficient of the test is given by formula 16. I 'f all the test items 
are of equal difficulty, this value will be identical with that 
given by formulas 10 and 11; otherwise it will be smaller. 


4. Summary 


If score on a test is the number of items correctly answered, and if 
we know the number of items in the test, the test variance, and the 
percentage of persons answering each item correctly, the test reliability 
may be calculated by 


K 
K 2 Se 
10 i= G) J= 
qo T7 rd ar 


where rzs is the reliability coefficient of the test, 
K is the number of items in the test, 
8,” is the test variance, and 
8,” is the variance of item 9, which equals p,(1 — p,), where p 
is the percentage of persons answering the item correctly. 


Substituting for s, its value in terms of Pe, We have 


K 
K b3 (p, — pe) 
11 ez = pae 

- E = i) s 


These formulas are based on the assumption that the average covariance 
between non-parallel items is equal to the average covariance between 
parallel items. Since, in general, the former is smaller than the latter, 
the values given by equations 10 and 11 will, in general, be underesti- 
mates of the reliability. 


Using only the test mean, variance, and number of items, we may 
estimate the test reliability by 


(16) ES ( K ) | | .M.— d] 


ke s 
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where M+ is the test mean and the other terms have the same definitions 
as in equation 10. If all the test items are of equal difficulty, equation 16 
will be identical with equations 10 and 11. Usually equation 16 gives 
values that are considerably less than the values given by equations 10 
and 11. Like equations 10 and 11, equation 16 may be used only when 
the score on a test is a linear function of the number of items answered 
correctly. 
Problems 


1. We have the following information on a test. Find the reliability by using 
formulas 11 and 16. 


Item p 

1 70 

2 90 

3 8s 

4 94 

5 77 

6 86 

7 69 

8 85 

9 46 

10 ae 

11 74 

12 60 

13 30 M = 15.5 
14 50 

15 85 s= 5.6 
16 90 

17 35 

18 25 

19 47 
20 91 
21 27 
22 23 
23 34 

24 32 

25 65 

15.50 


reri, h item correctly. (Ample time was 
i of persons answering eac td 
ria is us e ed in all 500 persons attempted each item.) The score was 
allowed for this test : 
number of items answered correctly. 
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2. Use the folowing information to obtain the test reliability by formulas 11 and 16. 


Item p Na 
1 73 500 
2 68 500 
3 90 500 
4 91 500 
5 70 500 
6 80 500 
7 77 500 
8 39 500 
9 61 500 
10 72 500 
11 71 500 
12 66 500 
13 37 500 
14 50 500 
15 85 500 M; = 15.3 
16 49 500 Sr = 5.1 
17 70 496 
18 65 495 


19 57 490 ` 
20 54 488 


21 16 475 
22 15 465 
23 15 455 
24 20 450 
25 23 440 


26 35 420 
27 34 410 
28 30 400 
29 28 400 
30 33 390 


Use the item analysis data to determine the reliability of the test. (Note that p is 
not percentage of total group answering item correctly.) 

Score is number of items answered correctly, 

Na is number of persons attempting each item. 

p is percentage of persons attempting the item who answer it correctly. 
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3. The following data on 52 students taking the composition section of the French 
104-5-6 in June 1940 were made available by Dr. Lawrence Andrus of the University 


of Chicago: 


Column A gives the item number. 


Column B gives the proportion of entire group passing. 


25 


B 


.13 
.94 
.42 
.96 
NE 


.27 


.65 
.00 


.63 


.42 


N =52. M = 54.52. s= 13.98. 


From the foregoing data: 


(a) Estimate the reliability of 


variance. 


(b) Estimate the re 
(c) Compare these val 


lues with 


problem 9, Chapter 15. 


A 


B 


.04 
.69 


liability from the test. mean and variance. 


449 


the test from the test variance and average item 


those found for the same set of test papers in 
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Speed versus Power Tests 


1. Definition of speed and power tests 

In this chapter the problem of distinguishing between speed and power 
tests will be considered, and a criterion will be proposed for determining 
the extent to which a given test approaches a “pure speed” or a “pure 
power” test. This material is presented as a suggestion toward a differ- 
ential rationale for speed and power tests. Relatively little has been 
written on this subject, despite the fact that the problems of item 
analysis, test length, item difficulty distribution, determination of 
reliability, and error of measurement are all quite different for the two 
types of tests. At present most tests are a composite in unknown pro- 
portions of speed and power, which makes the development of appropriate 
theorems in test theory more difficult than for the pure type tests. 

First let us define what is meant by a pure speed and a pure power 
test. A pure speed test is a test composed of items so easy that the 
subjects never give the wrong answer to any of them. The answers are 
correct as far as the subject has gone in the test. However, the test 
contains so many items that no one finishes it in the time allowed. The 
subject’s score, therefore, depends entirely on how far he is able to go 
in the time allowed. (We shall assume here that the subjects are in- 
structed not to skip any of the items, and that they follow that in- 
struction.) 

In order to discuss the speed-power problem symbolically we shall 
distinguish between two types of “errors.” We let 


W designate the number of items for which the subject gives an 
incorrect answer, 

U designate the number of items that the subject does not reach, and 

X designate the total error score on the test. 


That is, X = W + U. 


In a “pure speed” test W will be zero for each subject; hence both 
the mean and the standard deviation of W will equal zero. Also X = U, 
230 
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that is, the subject’s entire score is determined by the number of items 
that he does not attempt; hence the mean of X equals the mean of U, 
and the variance of X equals the variance of U. 

These are the characteristics of a pure speed test. Any actual test 
then may be said to approach being a pure speed test to the extent that 
My, the mean, and sw, the standard deviation of the W’s approach 
zero, and the mean and the standard deviation of the U’s approach the 
mean and standard deviation of the total number of errors (W + U). 

In a “pure power” test all the items are attempted so that the score 
on the test depends entirely upon the number of items that are answered, 
and answered incorrectly. (Again we assume that by careful directions 
none of the items is skipped.) In the pure power test, U will be zero 
for each person; hence the mean and standard deviation of U will be 
zero. Since for each subject X = W, the mean and standard deviation 
of X equal the mean and standard deviation of W. Again we should 
note that these characteristics hold strictly only for the pure power test. 
To the extent that these conditions are approximated, the test ap- 


proaches a power test. 


As has already been pointed out, the split-half (especially the odd- 


even) reliability cannot be used for any test except a pure power test. 
As the speed factor enters more and more into the determination of 
test score, the higher the odd-even reliability will become. Let us now 
consider a criterion that will indicate when a test is sufficiently close to 
test so that we may be relatively certain that the odd-even 
reliability or some other split-half reliability will not be spuriously high 
or low. Likewise a criterion for a pure speed test should indicate when 
a test is primarily a speed test so that the variability due to item diffi- 
culty or to carelessness in answering items is negligible. Depending on 
whether speed and power are positively or negatively correlated, the 
test-retest reliability of a test that involves both elements is likely to 
be higher or lower than the reliability of a test that involves only one 
element. Therefore, if we wish to measure speed in a given function it 
is important to make certain that we are dealing only to a negligible 


extent with a test involving power. 


a pure power 


2. Effect of unattempted items (or wrong items) on the 


standard deviation 
First let us consider the problem of determining whether the standard 
deviation of a test is influenced mainly by the speed or the power factor 


in the test. As in previous derivations, Mx = Mw + Mv so that we 
may designate the deviation scores by lower-case letters, and write 


wu. 
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Taking the standard deviation, we square, sum, and divide by N, 
obtaining 


(1) 
Expanding, we have 


Za? _ E(w uy 

N N 

2) B w? 4 Zu?  2Xwu 
: N NN N 


This may also be written 


2 
(3) Sz = a zp ms F 2ruuSuSu. 


In a pure power test, all subjects will finish, the variance of u will be 
zero; hence the last two terms will be zero. Ina pure speed test there 
will be no errors made by one who attempts an item; hence the first. 
and the last terms of the right-hand expression will be zero. 

In a pure speed test, s, = 0 and s, = s. In a pure power test, 
Su = 0 and sy = Sz 

It should be noted that rwu may well be negative. The subject who 
omits the fewest items will have answered the most items. Therefore, 
he may well have a great many errors, thus tending to make the subject 
with many actual errors (w) the one with the fewest unattempted items 
(u). For this reason, if we do not wish to calculate both s, and Sw, it is 
necessary to rely on the one likely to be zero, or near zero. 

For example, if rwu is —1, s; = Sw — s, or else 3 = Sn Sy: d 
either case it is possible that both s,, and Su would be larger than s,, 
thus making the use of either one alone unsuitable as an indication of 
the magnitude of the other variance. 

On the other hand, if either Sw Or Su is zero, 
other component must be very nearly equal to s+. The two extreme cases 
occur when rj, = +1 or —1. In the former case Sz = Sw + Su; in the 
latter | ss | = | s, — s, ly If suis = 0.1, s,/s, must lie between 0.9 
and 1.1. If s,/s, = 0.01, the ratio Sw/Sz cannot be less than 0.99 Hoe 
more than 1.01. In such a case we have a test that is primarily a power 
test, in the sense that the test variance would not be changed much if 
the subjects were allowed to finish the test. At one extreme possibility 
if they were allowed to finish, they would all get all the unfinished items 
wrong, in which event the new s, would equal the old one. At the other 
extreme no one would get any of the items wrong, in which case the new 
Sz would be equal to the present sw, which, as we have seen above. must 
be within 10 per cent of s, if the ratio Sufa ='0.1. A 

Thus from the viewpoint of effect upon the standard deviation of a 


or very nearly zero, the 
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test, we may say that a test is essentially a speed test if s,/s; is very 
small; and that a test is essentially a power test if s,/s; is very small. 
For a speed test Sw/Sz is small and 
So. Su Sig 
(4) 1+—>->1-—- 
Su Sy s 
A lower bound for the standard deviation is indicated by 
(5) Sy = $;— Sw} 
an upper bound indicated by 
(6) Sw’ = Sz F Sw- 
For a power test Su/Sz is small and 
E S 8 
(7) 1+—>—>1--: 
Ss Sz Bs 
From which we have a lower bound for the standard. devia- 
lion indicated by 
(8) Sw! = Sz — Su; 
and an upper bound indicated by 
(9) Sw! = Sz F Sw 


It should be noted that, although statements identical to the foregoing 
ones can be made for a large ratio, they are in that case not very helpful. 


For example, if s,/s; = 0.75, then 


Sw 
1+0.75 > — > 1= 0.75. 


a 
In other words, s/s: may be as small as 0.25, which is one-third of the 
ratio s,/s,, or it may be equal to 1.75, which is more than double the 
ratio su/8z. 


3. Effect of unattempted items on the error of measurement 

The error of measurement for the total score x is equal to the standard 
deviation times the quantity A/( — r). Since we have already con- 
sidered the standard deviation let us consider the other quantity, 
M — r) Again we define the reliability as the correlation between 
two parallel forms, designated 1 and 2: 


Dx Xe 


oe 
(10) on» 
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Let us first write out the numerator in terms of the component scores 
wand u: 


(11) Etits = E(w, + ui)(wo + ug). 
Expanding, we have 
(12) Erix = Eww + Zuyus + Ewou, + Ewqus. 


Using reliabilities and intercorrelations, and noting that variances of 
parallel forms are equal, we have 


(13) Erit = N T Nt vu x 2NTwuSuSu. 


Substituting equation 13 in equation 10, and setting the two variances 
in the denominator equal to each other, we have 


Puno Sw” F Tumu F Zr uuSuSs 
(14) Triz 2 


Sz 
Substituting equation 3 in equation 14, we have 
Tanga” F Piyush? F Winey 
2 a 
Su? F Su? + WouSwSy 


Using equation 15, we may write 


(15) 


Taya = 


(16) 1p we SÊ = fone) EM 
13 By + Seal ZrouSuSs 


From equation 3 we see that the de 
Making this substitution gives 


(17) 


nominator of equation 16 is s,?. 


sz = faz) = Sw (1 = Tow) F sa (1 = Tun) 
where s,? is the variance of the w-score 
incorrectly), 
su” is the variance of the u-score 
at the end of the'test), 
sz” is the variance of t, which equals w + u, 
Tww; Tuu, and r4; represent the reliabili 
formula is correct for either S 
estimates of reliability. 


(number of items answered 
(number of items unattempted 
ties of these scores. The 
plit-half or alternate form 
If x is defined as w + u, the error v 


is equal to the error variance of the 
variance of the u-score. 


ariance for the w-score 
w-score plus the error 
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It should be noted that in any test that has both the w and u com- 
ponents, the split-half reliability of w is unity; hence the last term of 
equation 17 is zero for any split-half reliability. A valid estimate of 
this second term is given only by a test-retest reliability. For a pure 
power test, the variance of w is zero; hence a stepped-up split-half 
correlation is a valid estimate of its reliability. If a test is primarily a 
power test, that is, if the variance of w is negligible, the stepped-up 
split-half correlation is still a reasonable estimate of the test reliability. 
However, when a test is partly speed as well as power so that the second 
term of equation 17 is not negligible, or when a test is primarily a speed 
test so that this second term is the major component of the error of 
measurement, the error of measurement obtained from a split-half 
reliability is too low. In such a ease a test-retest or a parallel form 
reliability must be used. Whenever the standard deviation of u is 
much greater than two or three tenths of the standard deviation of w, 
a split-half correlation is an unsafe basis for estimating the test reliability. 

If a test is primarily a power test, it is possible to use the split-half 
reliability to estimate a range for the error of measurement. Setting ru, 
equal to zero in equation 17 will give an upper bound for the error of 
measurement; setting it equal to unity (as is done in the split-half 
reliability coefficient) will give a lower bound for the error of measure- 
ment. For any split-half reliability in which the untried items are 
divided equally between the halves it is necessarily true that 


(18) Bl = faz) = ëw — Tow). 


Since the error of measurement would be larger if the subjects had been 
allowed to finish the test, but could not increase by more than the 
value of su” (1 — Tuu) when fruu = 0, we may use S?meas. to represent the 
error variance of the test and write 


(19) sai — ras) -- Su? > s ens, > 82°(1 — Tex); 


where the terms have the same definition as in equation 17. However 
equation 19 applies only in the case of a split-half reliability estimate 
for a test that is primarily a power test so that s4? will be a relatively 
small possible addition to the error of measurement. 

If a test is primarily a speed test, a test-retest or an alternate form 
reliability must be used. The error of measurement calculated from this 
reliability will have the two components indicated by equation 17. For 
a pure speed test, the first of these components would be zero because 
Sw is zero for a speed test. Regardless of the magnitude of sw, the error 
of measurement calculated from a test-retest or an alternate form 
reliability correctly represents the functioning error of measurement of 
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the test. If the directions for the test and the attitude of the subjects 
were changed so that no errors (w-score) were made, the new error of 
measurement would be different from the old one. It does not seem 
feasible at present to try to estimate the possible magnitude of this 
change. 


4. Effect of unattempted items on the reliability 
Equation 18 of Chapter 4 may be rewritten 


(20) 8? meas. = Sz (1 — rz). 


Solving equation 20 explicitly for the reliability coefficient, we have 


S asas, 
(21) - 1 É— A e 


Trz m 

If we use equation 21 and substitute various values of the error of 
measurement and the standard deviation as indicated in the two pre- 
ceding sections, we shall obtain some possible upper and lower bounds 
for the reliability coefficient of power tests that are partially speeded 
and'speed tests that are in part power tests. 

For a test that is primarily a power test, a possible estimate of a lower 
bound for the reliability coefficient may be found by using a small 
estimate of the standard deviation as given in equation 8 and a large 
value for the error of measurement as indicated in the first expression 
of equation 19, Substituting these two values in equation 21 gives 


S£ — tex) + s 
(22) NUN E — Tee) aa 
Ge = Sa)” 


Dividing through by s,”, setting H for s,/s;, 


and writing the expression 
with a common denominator gives 


5 p = L2H +R? 1H re — H? 
pr 1—2H + H? 


Simplifying and ignoring the term IP, we have 


(24) TE b 2H 


Using a stepped-up split-half correlation for the reliability of a 
partially speeded power test will certainly give a figure higher than the 
actual reliability of the test so that the obtained reliability coefficient 
that has been designated by 7,2 may be regarded as an upper bound for 
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the reliability coefficient. We may use equation 24 as a lower bound 
and designate the correct reliability by R, obtaining 

Tor — 2H 

1-2H"' 


(25) Ta > R > 


where 7;z is the stepped-up split-half correlation, 

H is s,/s,, the ratio of the standard deviation of the “number 
unattempted score” to the standard deviation of the 
number not answered correctly (u 4- w), and 

R is the reliability of the test. 


Tt should be noted that for many tests the right-hand term of equation 
25 will give a lower bound that is distressingly low, and may well be far 
lower than an alternate form reliability for the test. However, it seems 
probable that, if this lower bound turns out to be satisfactorily high, 
there can be little doubt that the reliability of the test will be satis- 
factory. Beginning with equation 15, there are various other assump- 
tions that may be made regarding what might happen if a parallel form 
instead of a split-half reliability had been used. An experimental 
investigation of the typical behavior of the various terms in equation 15 
is probably needed in order to determine which assumptions are most 
appropriate. 

Another possible lower bound for the reliability of a somewhat speeded 
power test can be illustrated with equation 15. We may assume that 
for a split-half reliability all the terms on the right-hand side of equation 
15 are correct except for fuuSu- In a split-half correlation, this term is 
clearly too large, since r,, is necessarily unity. If the term $,? is sub- 
tracted from the numerator of equation 15 this will have the effect of 
assuming that ruu is zero, and may well give a good lower bound for the 
reliability of the test. Let us refer to equation 14. The numerator may 
be expressed as r,,s,?. If we subtract sy” from this, and divide by the 
variance of z, we shall have a reliability figure under the assumption 
that r,, is zero instead of unity. Thus we have 
SCRI: 

(26) Yan = DN ec. 
B^ 
Writing H for Su/Sz, we have 


(27) Ter = frs = He, 


where the terms have the same definition as in equation 25. For this 
new lower bound we may write 


(28) Tre E! fas H*. 
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If a power test is partially speeded, and a split-half reli- 
ability (rz;) has been calculated, equation 25 or 28 may be 
used to give some idea of the extent to which rz, is an over- 
estimate of the test reliability. 


Some evidence has been presented (see Gulliksen, 1950) indicating 
that equation 28 may be satisfactory provided H? is less than 0.2. 

If. a test is primarily a speeded test, an alternate form or a test-retest 
reliability must be used. Such a reliability correctly represents the 
functioning reliability of the test. If only the number of items un- 
attempted is used as the score, we have a relatively pure measure of 
speed; if the number correct is used, both speed and accuracy enter 
into the score. By using both these scores, we can determine the relative 
reliability of speed alone and of speed together with accuracy. Since 
this problem is purely experimental, no further theoretical discussion 
will be given here. 


5. Estimation of the variance of the number-unattempted score 
from item analysis data 


The preceding discussion has been in terms of the number of unat- 
tempted items because it is possible to obtain the variance of this score 
from item analysis data which gives number answered correctly, number 
answered incorrectly, and number of persons not reaching the item. 
Thus, if item analysis data are available, the variance of the *number- 
unattempted score," hence the ratio H, can be calculated without 
rescoring of the papers. 

Let us use K to designate the number of items in the test and y; to 
designate the number of persons who did not reach item g. It is clear 
then that y, = Yg, since all persons who did not reach item g did not 
reach any subsequent item. We shall also assume that yg = 0 for 
the items near the beginning of the test. It is clear that 


(29) 


U= 2, Ve- 


N K 
=1 g=1 


i 


That is, the sum of all the unattempted scores may be obtained by sum- 
ming over persons or over items. "Therefore, 


N K 
2 U: 25 Ye 
g=1 


(30) üt a 
dE; 


N 
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That is, the average unattempted score is equal to the sum of the 
number unattempted from item 1 to item K, divided by the number of 
persons taking the test. 

In order to obtain the standard deviation of the number-unattempted 
score, we shall use the usual formula for standard deviation written as 
follows: 

U? 
i=l 


Ris — My’. 
(31) Su N L 


N 


Since My is given by formula 30, we know all the terms in this equation 
except EU?. Let us use n, to indicate the number of persons making 


an unattempted score of U; then 


ny = YK — YK-1 
no = YK-1 — YK-2 
na = yk—2  — YK-3 

(32) nu = YK—u+1 T J/K—u 
ng—2 = ya — lis 
nk ys = UA 
NK = Yi. 


Many of the terms in equation 32 will be Zero, since all the subjects 
will presumably attempt many of the earlier items in the test. 

In order to obtain ZU? it is necessary to multiply the first frequency 
by 12, the second by 22. and so on. The sum of the resulting products 


is XU?, Using equations 32 to write this sum of products, we obtain 


(33) EU? = (yx — yK) + PKA — yx) 
4 3°(yx_2 — yx—a) o uaa — Jk) 
ded (= 2)s — 2) = 1)?(y2 — yi) + Ky. 
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As pointed out in connection with equations 32, when one of the 
y-terms is zero, all subsequent terms are zero and may be omitted from 
the summation. : 

Removing the parentheses in equation 33 and performing the sub- 
tractions gives 


(34) ZU? = yk(1?) + yk (7 — 1”) + yk o? — 2) +--+ 
yK-ugibi? — (u — 17] + yx—al(u + 1? — wy 
Toca — 2)? — (K - 3 

+ yol(K — 1)? — (K — 2] + w(K? — (K — 19. 


As before, this series is continued until all subsequent y’s are zero. 
Since the difference of successive squares constitutes a series of con- 
secutive odd numbers, equation 34 can be written as 


(35) ZU? = lyg + 3yk ai + 9yk 2 +--+ Qu — Dyk ua + 
Qu + Dyk—u ++++-+ (2K — 5)ys + (2K — 3)ys + (2K — Vy. 
The sum of this series may be written 


(36) 


N 


K-1 K-1 
U? =2 DY uyku + DY Yru 
u=0 


u=0 


t=1 


The summation begins with u = 0 because the first term is yx, that is, 
VK—u, Where u = 0. For the sake of completeness, the summation. is 
indicated as extending to (K — 1), but in any computational problem 
many terms will be zero and can be omitted. From equ 
that the last term of equation 36 is equal to NM 
tion 36 in equation 31, we have the solution, 


" T K—1 " 
(37) sy = ($): pr Uyk—-u + My — My?, 


u=0 


ation 30 we see 
u. Substituting equa- 


where s, is the standard deviation of the number-un 
Yr—u is the number of persons not reaching the 
N is the number of persons, and 
M y is given by equation 30. 


attempted score, 
(K — u)th item, 


By using equations 30 and 37, Su, and hence H (for use in 
equations 24, 25, and 28), may be calculated directly from 
item analysis data showing the number of persons not reach- 
ing each item. These equations will enable us to avoid the 
labor of rescoring the answer sheets in order lo obtain su. 
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6. Summary 
In discussing the speed-power problem, the following symbols were 
used: 


W (wrongs), the number of items for which the subject gives an 
incorrect answer, 

U (unattempted), the number of items not reached at the end of the 
test, and 

X the total error score (X = U + W). 


It is assumed that there are no skipped items. 
In a pure speed test Mr and s, are zero. If s,/s, is small (0.1 or 
less), the test may be regarded as primarily a speed test. In this case 
Sw Su Sw 


(4) 1+—>—>1-—, 


Sr Sz Sr 
a lower bound for the standard deviation is indicated by 
(5) Sar = Se — Sey 
and an upper bound for the standard deviation by 
(6) Su" = Sr F Sws 


where sw is the standard deviation of the W-score, 
s, is the standard deviation of the U-score, and 
5, is the standard deviation of the X-score. 


If a test is primarily a speed test, the reliability and the error of 
measurement must be estimated by means of a test-retest or an alternate 
form reliability coefficient. The reliability and the error of measurement 
so computed will correctly represent the functioning reliability of the 
test under the test directions and administrative conditions that were 
used. If the test conditions are changed in an effort to eliminate the 
W-score, it does not seem possible to make reasonable estimates regard- 
ing what will happen to the error of measurement and the reliability. 

In a pure power test, Mu and s, are zero. If s,/s; is small (0.1 or less), 
the test may be regarded as primarily a power test. In this case 


Su _ Sw Su 
(7) lue Se d -—, 
Sz Sx Sz 


a lower bound for the standard deviation is indicated by 


(8) s Sw? = Sz — Su; 
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and an upper bound for the standard deviation by 

(9) Sy" = Sz + Sy. 

For any split-half reliability that divides the U-score equally between 
the two halves, it is necessarily true that 

(18) 82°(1 — Taz) = s,*(1 — rus). 


If s?meas. is used to designate the error variance as obtained from an 
alternate form reliability or from allowing the subjects to finish the test, 


(19) SP (1 — Tex) + s > sas, > 822(1 — Tez). 


Two different methods were suggested for estimating a lower bound 
for the reliability coefficient that would be obtained if the students were 
allowed to finish the test, or if an alternate form reliability had been 
calculated. It was found that 


spo ee 
(25) Trz Iy 
or that 

(28) Taz > R' > Tas — H?. 


It should be noted that both these estimates are highly tentative and 
that more experimental work on the relation between speed and power 
needs to be done before we can know which assumptions are the best 
ones to make. It seems now that R’ is better than R. 

In the four preceding equations: 


Tzz is the split-half reliability for the X-score, 


Tww is the split-half reliability for the W-score, and 
H is $,/5,. 


In order to guard against the possibility of spuriously high split-half 
reliabilities being reported for partly speeded tests, it would seem desir- 
able to present routinely the coefficient H or the lower bound of formula 
28 whenever a split-half reliability is reported. 

It was also shown that the variance of the U-score could be calculated 
ta showing the number of persons who did not 


reach each item. If Ye designates the number of persons who did 


not reach item g, 


(30) My = (s) Vp 


N. 
and 


1 K—1 z 
(37) Su = (5) 2 5^ Uy. d- My — My?, 


u=0 
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where My is the mean U-score, 
2: 2 " 
8,7 is the variance of the U-score, 
N is the number of persons taking the test, and > 
YxK—u is the number of persons not reaching the (K — u)th item. 


Problems 
Data FOR PROBLEMS 1-3 


Item p Na 


1 96 500 
2 94 500 
3 90 500 
4 87 500 
5 92 500 
6 82 500 
z 84 500 
8 87 500 
9 80 500 
10 60 500 


11 68 500 
12 63 498 
13 45 497 


14 55 497 M = 15.9 

15 50 495 s= 8.3 
r= .97 

16 40 493 corrected odd- 

17 62 490 even correlation. 


18 50 487 
19 65 485 
20 30 480 


21 20 470 
22 23 465 
23 25 460 
24 40 450 
25 30 441 


26 22 432 
27 26 417 
28 36 406 
20 48 393 
30 21 372 


Na = number of persons who attempted, that is, indicated some answer (right or 
wrong) for each item. d 
p = percentage of those attempting the item who answered it correctly. 
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1. Calculate the standard deviation of the number-unattempted score from the 
item analysis data given. 


2. Using the data given, plot the frequency distribution of the number-unattempted 
score, caleulate the mean and standard deviation of this distribution to verify the 
calculation in problem 1. 


3. How seriously might the rellability of this test be affected by the speeded 
nature of the test score? 


i " Data ron PROBLEMS 4-6 
Gross Score 
Mummers 
Test | of Items R Su 
Mean | Standard 
Deviation * 
A 180 73.6 27.4 .97 2.1 
B 150 93.7 16.3 .93 7.6 
Cc 90 55.1 14.5 95 6.3 
D 70 30.2 8.4 .85 0,5 * 
E 100 53.6 | «11.2 .82 5.4 
LI 


R = 2r/(1 + r), where r = the odd-even correlation. 


4. Give a lower bound for the reliability coefficient of each of the tests A to E, 


5. Give the error of measurement for cach test, and also an upper bound for this 
error. 

6. (a) For which tests is an odd-even reliability justified? 

(b) Which tests require an alternate form or a test-retest reliability? 


t 


j>: 


18 


Methods of Scoring Tests 


l. Introduction 

In this chapter we shall consider two basically different types of 
scoring problems. One type includes the problems in scoring tests where 
each item has one or possibly more answers that are correct (hence are 
scored one point) and other answers that are incorrect (hence receive 
zero credit). The other type of test question is the one for which there 
is no generally “correct” answer. Items used in attitude, interest, or 
personality schedules are of this type, and they present special scoring 
problems. pa 

Only the simpler methods of scoring tests, based on time or on item 
count, will be considered here. Scoring methods that attempt to deter- 
mine “level reached,” such as used in the Binet test, demand a different 
type of theoretical approach, and will not be considered here. The more 
precise absolute scaling methods presented by Thurstone (1925 and 
1927b) also require a different theoretical approach, and are beyond the 
scope of this book. 

For purposes of this discussion, we shall consider that the items of a 
test can be divided into four categories, designated as follows: 


R (rights), the number of items marked correctly, 

W (wrongs), the number of items marked incorrectly, 

S (skips), the number of items that have not been marked, but are 
followed by items that have been answered (R or W). It looks 
as if the subject attempted to work the item, and then decided 
-to skip it and move on to a later item. + 

U (unattempted), the number of consecutive items at the end of the 
test that are not marked. It looks as if the subject did not have 
a chance to attempt these items before time was called. 


There is a possibility that the number of items skipped (S) or the 
number at the end of the test that are unattempted (U) would be useful 


scores. Such scores, coupled with careful test directions, may indicate 
p 245 
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*teautiousness" or some other similar personality characteristic of the 
subjects. Some subjects may show a consistent tendency to mark items 
and to get them wrong; others may hesitate and skip items, hence have 
a much larger S or U score. No one seems to have investigated such 
possibilities. 


2. Number of correct answers usually a good score 


The way tests are usually handled at present is to frame the directions 
to emphasize that the subjects should answer the items consecutively. 
This means that the number skipped (S) will be zero or negligibly small. 
In a power test an effort is made to allow sufficient time for nearly all 
the questions to be answered by nearly everyone. This means that the 
number of items unattempted (U) will be small, and the score can be 
regarded as depending primarily on the number of items marked in- 
correctly (W). In a speed test an effort is made to have no items 
answered incorrectly (IV = 0). The score in this case can be regarded 
as depending primarily on the number of items unattempted (U). 

If a test is primarily a power test, that is, if S and U are each negligible, 
the score may be the number marked correctly (R) or its complement 
(W), the number marked incorrectly. If a test is primarily a speed test, 
as is the case if W and S are each negligible, the score should be the 
number marked correctly (R) or its complement (U), the number of 
items unattempted. We shall now consider the cases in which S or U 
(the number of unmarked items) is not negligible for a test that is 
designed as a power test, and in which S or W , the number of items 


marked incorrectly or skipped, is not negligible for a test that is designed 
as a speed test. 


3. The problem of guessing in a power test 


Under ordinary examining conditions, even if S and U, the number 
of unmarked items, are fairly large, the number of items m 
rectly (R) will turn out to be a suitable score for the examin 
will be the case if each student reads each item an 
solve the problem before marking an answer. In general, the student 
who knows the material will solve the problems correctly and more 
quickly; hence he will have more correctly marked answers than the 
student who does not know the material. However, the test constructor 
and the test scorer must bear in mind that it is possible for a student 
who does not know the answer to an item to mark it correctly by chance 
in an objective examination. If practically all items are marked by 
each of the students, this effect is not a serious one and can be ignored. 
However, sometimes a student may observe that he has only two minutes 


arked cor- 
ation. This 
d honestly tries to 


Chap. 18] Methods of Scoring Tests 247 


left and may feel that it is good policy to mark quickly the last twenty 
or thirty items that he does not have time to read in order to get the 
benefit of a chance score. If the score is taken as number marked cor- 
rectly, this student is likely to add more to his score in the last two 
minutes than another equally good student who spent the last two 
minutes attempting to solve one item. 

It should be possible to detect such cases by plotting the number of 
the last item attempted as the abscissa, against R, the number marked 
correctly as the ordinate. On such a plot, the line y = x would indicate 
the locus of scores that were perfect as far as the items were marked, and 
the line y = (1/4)z, where A is the number of alternatives for each 
item, would indicate the locus of the average chance score. For example, 
if the test is composed of five-choice items, the average score from pure 
guessing would be one-fifth of the items correct, and the line y = (1/5) 
would be the locus of such scores. If some points with a relatively high 
R, number of correct answers, are near this line, they show that a rela- 
tively good R score is made by some persons who are apparently guessing 
the answers to a large number of items. 

A more accurate plot to indicate the presence of good scores made by 
guessing would be to plot the number correct (X) as the ordinate against 
the number attempted (R + W) as the abscissa. In the plot previously 
mentioned, the number of the last item attempted is equal to 
(R+ W +S). In the new plot the points would be moved to the left, 
and therefore away from the chance line. That is, if the first plot of R 
against number of last item attempted shows no points near the chance 
line, there are no scores that are chance scores. If the first plot shows 
points near the chance line, it may be desirable to make the second plot, 
which is more time consuming, in order to sce if we still have a clear 
indication that good R scores can be made near the level of an average 
chance score. 

If we have a test in which some persons are making high R scores 
on the basis of a chance ratio between number right and number at- 
tempted, the situation is unsatisfactory, and steps must be taken to 
alter it. If the test is a trial run, it may be possible to shorten the test 
by eliminating some of the items, so that more people will finish the 
test, or it may be possible to lengthen the time allowed for the test so 
that more persons can finish. If either of these changes can be made, 
we may still retain the simple number right score. This score has the 
advantage of being quick to obtain, and of allowing relatively little 
opportunity for clerical errors. However, if the test scores must be 
used as is, or if it is not possible for other reasons to shorten the test 
or lengthen the time, it is possible to consider more complicated scoring 
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formulas that attempt to take account of some of the possible effects of 
guessing. It must be emphasized again that there is no reason for con- 
sidering any of these formulas if, for most of the people, R + W is 
essentially equal to the total number of items in the test. Such formulas 
are to be used if, and only if, the number of unmarked items (S + U) 
is fairly large for some persons, and fairly small for others. 

Let B = the number of items left blank. This includes those left 
blank because they were skipped and those not attempted at the end 
of the test. That is, 

B-S-U. 


Using K to designate the total number of items in the test, we have 
K=R+W +B. 


One method of dealing with the problem of variation in amount of 
“guessing” from one person to another is to assume that, if there are A 
alternative choices for each item, then if each person had answered every 
item he would have answered 1/A-th of them correctly by chance. Let 
Xz designate the score (number right) that would probably have been 
made if every item had been attempted; then 


(I) Xs -n«()s. 


It should be noted that, if any of the items in an examination are so 
difficult and have such plausible distractors that less than 1/A-th of 
the persons attempting the item get it correct, equation 1 cannot be 
used. Using it would have the peculiar result that persons no 
ing an item would get a higher score than th 
item and answered it. 


t attempt- 
ose who thought about the 
Items of such a high level of difficulty should 
not be used unless there is some special reason that demands their use. 
For example a test of “common fallacies” or “popular superstitions” 
would necessarily contain items that often might be answered correctly 
by less than the expected chance percentage of those attempting the 
item. In such a case, however, it is necessary to allow time for every 
person to answer each of the items so that no correction for effects of guessing 
will be necessary. 1 

Instead of estimating how many items wi 
correctly if all items had been marked, it is also plausible to approach 
this problem of correction for guessing by attempting to estimate the 
number of items for which the person k: 


[it : new the correct answer. In this 
approach it is assumed that the items left blank àre not known so that 
nothing need 


be added for them. It is also assumed that of the items 


ould have been marked 
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that the person guessed, 1/A-th were (by chance) answered correctly 
and are included in the group of items (FR) answered correctly. The 
remaining fraction of the items answered by guessing, (A — 1)/A, 
represents the items answered incorrectly or the group of items pre- 
viously designated W. It follows then that W/(A — 1) is equal to the 
number of items in the Æ group that were lucky guesses. This number 
should be subtracted from the number answered correctly to give an 
estimate of the number of items for which the answer is known. Let 
Xy designate the number of items for which the answer is known: then 


W 
A—1 


(2) Xy-H-— 


Again it should be noted that, like equation 1, equation 2 cannot be 
used when items are so difficult that less than a chance proportion of 
those attempting the item get it correct. Hamilton (1950) has utilized 
the regression line to give a more accurate treatment of the problem 
of chance success. 

Equation 1 will always give higher numerical scores than equation 2, 
except for persons making a perfect score. From the viewpoint of 
checking against norms, for example, the two equations are not inter- 
changeable. However, from the viewpoint of ranking the students or 
of making correlational studies, the two scores will give exactly the same 
results, since they are perfectly correlated. To show this, we shall write 
the functional relationship between Xz and Xy. 

Since K = R + W + B, we may write 
(3) B=K-W-R. 

Substitute this value in equation 1 and rearrange terms, obtaining 


A-—1 


1 (m 
(4) Xp = Bep rds 
If we multiply both sides by A/(A — 1) and subtract the constant 
K/(A — 1), we have 
A K wW 


7 = —— 


(8) DEP CLE 1 FEET 


Since the right side of equation 5 is identical with equation 2, we have 
expressed Xj as a linear function of Xp. i 

There is another method of dealing with the problem of correcting 
scores on a primarily power test for the effects of possible guessing. The 
method to be proposed guards against the practice of quickly answering 
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all unfinished items just before time is called. The suggestion is to use 
the score 


U 
(6) ty = Bt 


To the number right (R), add 1/A-th of the number of items at the end 
of the test that were unattempted by the subject. This differs from 
equation 1 in that no partial credit is given for skipping an item. If a 
subject studies an item, it is desirable to encourage him to give his most 
considered response to that item. Under equation 6 the subject would 
have everything to gain and nothing to lose by marking each item that 
he had time to study. However, there would be no point to rushing 
through during the last minute of the examination and marking all 
remaining items since he would get credit for a chance proportion of 
them anyway. Even the last minute of the examination would, under 
such a system, best be spent in attempting to give a correct answer to 
one more item. 

As far as the writer knows, equation 6 has not been suggested pre- 
viously or studied, especially with respect to its effect on the attitude of 
students taking an examination. Perhaps it would avoid some of the 
undesirable examination attitudes that are sometimes engendered by 
objective examinations. It must again be stressed that equations 1, 2, 
and 6 are suggested only when it is not feasible to allow the students to 
finish the examination. The best policy is to insure that practically 
all items are attempted by practically all the students, and tken simply 
score number right (2). 

If we depend on IBM machine scoring of tests, the possibility that a 
student will mark several answers to one item must be considered. 
Multiple marks on a single item may occur either because the student 
has misunderstood the directions or because of a belief that the “machine 
will just sense the correct marks.” By ordinary scanning procedures 
it is difficult to be sure of detecting all multiple marking. An easy 
method of dealing with this possibility, and also with the possibility 
that some students will mark items without reading them in order to 
finish the test, is to score the test rights minus an appropriate fraction 
of the wrongs. For hand scoring, this equation is considerably more 
labor than the number right score. For machine scoring, 
are scored in the same time regardless of the scoring formul 
rights minus wrongs scoring, it takes a little lon 
liminary adjustments on the machine. 

When marking papers by hand, scoring solely on the basis of number 
correct is usually perfectly satisfactory and considerably more r 


the papers 
a. For the 
ger to make the pre- 


apid and 
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accurate than using any of the foregoing scoring formulas. However, 
if the test has a short time limit so that many persons do not finish it, 
the scorers must note, and call to the attention of the supervisor, any 
cases in which an unusually large number of items have been answered 
and an unusually large number of errors occur toward the end of the 
test paper. If a moderately high score is made in this way it may be 
desirable to rescore the papers using one of the scoring formulas given 
in this section. 


4. The problem of careless errors in a speed test 


In a speed test none of the equations given in the preceding section is 
appropriate. Giving credit for 1/A-th of the unfinished items (equations 
1 or 6) is inappropriate because the score in a speed test should represent 
the number of items the student is able to do in the allotted time. 
Deducting for items answered correctly by chance (equation 2) is in- 
appropriate because in à properly constructed speed test the items should 
not be difficult. Each student should be able to answer each item if he 
studies the item. Thus there is no problem of estimating how many 
items the student knows, as distinct from how many lucky guesses he 
made. The problem is simply, how many items can be solved in the 
If time were increased sufficiently each student would 
receive a perfect score. If a speed test is properly constructed, and if 
the students respond properly, the number of skips (S) and the number 
answered incorrectly (WW) will be zero. The test can be scored either 
in terms of number right (2) or number unattempted at the end of the 


test (U). 


However, if a 


allotted time? 


test is designed as a relatively pure speed test, and we 
observe papers in which all the items are marked and the number of 
errors near the end of the paper is much greater than the number near 
the beginning, it may be well to suspect that those students are answering 
the items without studying them in order to capitalize on a possible 
chance score. It is then necessary to rescore the papers using some sort 
of penalty for items marked incorrectly. 

In order to motivate the students to answer each item correctly (not 
to mark items carelessly), and not to skip items, it is desirable to stress 
both these points in the instructions. It may also be well to have a 
small penalty for skips and a larger penalty for errors. This formula 
would be 

uci wW sS 
" dii B D 


where C and D are arbitrary constants, C « D. In order to motivate 
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the students properly, it perhaps is appropriate to make D slightly 
larger than the number of alternatives (A), and C slightly smaller than 
A — 1. For example, in a five-alternative multiple-choice test the 
formula might be chosen as 


Perhaps such a device would encourage the student to mark the item, 
but to be careful to mark it correctly. 

If the penalty for errors or omissions in a speed test is to be used, it 
is probably desirable to study the effect of different penalties on the 
performance of the students. For example, the penalty for errors in a 
typing test is arbitrarily set at ten words per error. What would be 
the effect on student performance if he knew that the penalty would be 
twenty words, or if he knew that it would be five words per error? 

It should also be noted that, if we have a criterion to predict, it is 
unnecessary to bother with these arbitrary scoring formulas. Multiple 
correlation methods will give the best weights to use. 

It is important to adjust the instructions and the motivation of the 
students so that all items are answered, and are answered honestly, after 
some study and thought by the student. If such an attitude is secured 
from all students, then either number right (2), number wrong (W), or 
number not attempted (U) could readily be used as the score without 
troubling about any scoring formula. It would be desirable to choose 
the one of the three that had the largest variance for that particular 
test as the final score. Every effort should be made to design the 
examination, the instructions, and the motivation of the students to 
discourage the use of various irrelevant tricks that are frequently applied 
in connection with objective examinations. For example, students often 


inquire if there is a "penalty for guessing." If the answer is “no,” they : 


will mark a great many items without bothering to read them if time 
seems short, with the expectation that some will be correct by chance; 
if the answer is “yes,” they will Skip items rather than imperil their 
Score by guessing. Either attitude is to be avoided since both introduce 
considerations that are probably irrelevant to the student's knowledge 


and understanding of the field, and these are the things that should be 
measured by the examination. 


5. Time scores for a speed test 


Sometimes the time taken to perform a standard task is the score 
assigned to a test. The larger the score, the poorer the performance. 
In this respect the time score has one property of an error score. It 
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should be noted that in general a time score is not especially suitable 
for group testing. When testing individuals or small groups of three to 
five, the examiner can easily hold a stop watch and mark the time 
when each person finishes. If we wish to secure more uniform timing 
and are satisfied with relatively coarse groupings, such as half-minute 
or one-minute groupings, it is possible to have a single time-keeper for a 
large group of proctors, each of whom is responsible for a small group 
of students. The time-keeper displays a card with a number on it or 
writes a number on the blackboard at stated intervals. This digit 
indicates the number of minutes or half minutes since the test started, 
or the number until the conclusion of the test (if we wish to have a score 
such that higher numbers indicate better performance). The proctor 
then writes this number on the student’s answer blank when the student 
has finished the task. If we are willing to rely on the students, it is 
possible to have the student write the number on his own answer blank. 
Tt is probable that this method should not be used if only one time limit 
is being taken, since it would be relatively simple for the student to 
write a different number from the one that was actually being shown. 
However, if the test is long and keeps the students working for the entire 
time, it is probably all right to have the student indicate the time of 
finishing each of a number of subsections, if such a time score is desired. 

It should also be noted that time scores could readily have many of 
the properties of number-correct scores. For example, doubling the 
test would give two time scores for each person, and the total score on 
the test would be the sum of the two time scores. If the means, vari- 
ances, and covariances satisfied the criterion for parallel tests (see 
Chapter. 14), the theorems regarding effect of increased length would 
hold. In applying the theorems previously established to time scores, 
it is essential to see that differential fatigue is not a serious factor. For 
example, the time taken to run à hundred-yard dash is a perfectly good 
score for the hundred-yard dash “test.” It does not follow that the test 
becomes more reliable as it is lengthened. We cannot use four one- 
hundred-yard dashes in succession and then perhaps decide to use a 
five-hundred-yard dash as our final test in order to secure adequate 
reliability. The same consideration applies in lesser degree to any test. 
To a considerable extent the nature of a fifteen-minute test cannot be 
the same as a six-hour test. There are added factors of fatigue, etc., 
entering in; and we usually find six-hour tests divided into two three-hour 
sessions. In other words, when equations on test length are used for 
timed tests, the same precautions previously mentioned apply. Each 
of the new “unit” tests must be “parallel” to each other. This means 
that the test average, the standard deviation, and all intercorrelations 
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must be the same. If this is not found to be true as we lengthen either a 
power or a speed test, the equations relating test length to other param- 
eters no longer hold. 


6. Weighting of time and error scores 
Sometimes the question of weighting time and error scores to get a 
single composite score is raised. Again, in general, the best thing to do 


Errors ———> Errors ———>- 


Time 
7 
na 

Time ~> 


Time ———— 


r3 
(c) "ig 


Figure 1. Illustrating different weightings of time and errors: (a) Equal weights 
for time and errors. (b) Errors receive twice as much weight as time. (c) Time 
receives twice as much weight as errors. 


is to have a criterion and to use the weights that best enable us to predict 
the criterion. The multiple correlation approach is the best one for 
the problem of weighting when an outside criterion is available 

Often, however, no outside criterion is available. Then the only 
recourse is to fall back upon judgment. A detailed technical method 
for securing and dealing with such judgments is given by Thurstone 
(1931) in his article “The Indifference Function." If a sufficient 
amount of time and number of judges are available, this method should 
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be used. However, a very crude approach can also be used that utilizes 
a correlation scatter plot of time against errors. This scatter plot may 
be a plot of actual cases or simply an imaginary one where the instructor 
is asked to suppose that cases of various types occurred. 

The instructor in the course, or a group of instructors, or some other 
authority is then asked to judge “which is better” for various pairs of 
students. In this way we can rapidly and crudely determine a family 
of lines that will divide the scatter plot into appropriate zones of increas- 
ing ability. Usually these zones can be approximated closely enough by 
a series of parallel straight lines. This fact indicates that a linear com- 
bination of time and errors is adequate. The relative weights of the 
two factors are proportional to the slopes of the lines. For Figure 1a 
time and errors are equally weighted, for Figure 1b errors have twice 
the weight of time, and for Figure le time has double the weight of 
errors. 

A similar graphic system can be used for determining rapidly the 
opinion of an expert judge in the field regarding the appropriate weight- 
ing of any two subtests in a composite score. 


7. Weighting with a criterion available 

When a definite criterion score is available, we should always use the 
multiple correlation to determine the relative weighting of time and 
errors, of rights and wrongs, or of rights, wrongs, and skips, or any other 
set of subscores that can be obtained from a test. 

If the test score is used to predict a definite criterion, the scoring 
method should be based on multiple-correlation methods to secure 
f the criterion. In principle it is possible to 
determine a separate weight for each item in the test, and to do this in 
such a way as to maximize the correlation of total test score with the 
criterion. In practice, however, this procedure is not usually fol- 
lowed, partly because of the very great amount of calculation involved 
and partly because the individual item weights are likely to be very 
unstable unless based on large numbers of cases. For example, Guttman 
reports a study in which a sample was divided into two random halves; 
the first half was scored on the basis of multiple-correlation weights 

ith a resulting multiple correlation of .73. 


assigned to each item, W: 
When these same weights were used on the second sample the correla- 
tion was .04. (See Horst, 1941, page 360.) i 

It is often, however, both feasible and desirable to determine the 
multiple correlation and corresponding weights for a few subscores. 
For example, instead of being weighted on the basis of average chance 


an be weighted to secure maximum prediction of 


maximum prediction o 


success, the wrongs ¢ 
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the criterion. This method was given by Thurstone (1919). If the 
rights are to be weighted unity, the best formula is ! 
Sg Tyw — TyRTrw 
(8) Xy: = R— =e W. 
Sw TYR — TyWTRW 
where sz is the standard deviation of the rights score, 
sy is the standard deviation of the wrongs score, 
rrw is the correlation between “number right" and “number 
wrong," à 
ryp is the validity coefficient for number right, and 
ryw is the validity coefficient for number wrong. 


When the weights of equation 8 are used, the correlation between 
Xw and Y will be the multiple correlation given by 


Tyg? + ryw? — 2ryrrywrew 
(9) Ryxy. = EMO * 
1 —rrw 


The same type of weighting scheme can be used for any two variables 
that are being used to predict a third. For example, if Y is the criterion, 
W is the number of errors, and T is the time score, the best weighting 
of errors in relation to time will also be given by formula 8 if T is substi- 
tuted for R. Formula 9 also gives the multiple correlation of this 
weighted time and errors score with the criterion, if correlations involving 
time are substituted for those involving number right in the formula. 


8. Scoring items that have no correct answer 

The items in tests of personality characteristics, attitudes, and inter- 
ests frequently do not have clear-cut “correct” and “i 
Then it is necessary to have a criterion that we w 
to set up the scoring key for the test. The simplest scoring key is one 
in which each alternative answer is to be scored either 1 or 0. If there 
are only two alternatives, say A and B, we obtain the average criterion 
score for those choosing the A alternative, and the average criterion score 
for those choosing the B alternative, and assign the score 1 to the alter- 
native having the higher and zero to the alternative having the lower 
criterion score. If there are many items, and we desire to eliminate 
some of them, a measure of the significance of the difference, such as 
(M4 — Mz)/sa—p may be obtained, and items with a low value for 
this critical ratio may be discarded. 


neorrect" answers. 
ish to predict in order 


1 Formulas 8 and 9 are derived in Chapter 20. See equations 56 and 58 of Chapter 
20. 
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If there are several alternative responses for an item, it may still be 
well to stick to a simple 1 or 0 scoring key. Then the procedure would 
be to compute the mean criterion score for each of the alternatives, to 
arrange them in order of magnitude, and to observe where the greatest 
difference occurred between successive means. The ones above this 
dividing point are scored 1, and those below are scored 0. A somewhat 
more elaborate procedure would be to use a measure of the significance 
of the difference for each of the possible cutting points, and to choose 
that which gave the largest critical ratio. 

If a very large number of cases are available for standardization, so 
that we can have confidence in the stability of the results, it may be 
reasonable to consider a more complex scoring key that would assign a 
different weight to each possible alternative. A procedure is given by 
Guttman (see Horst, 1941, page 341). In order to maximize the correla- 
tion between the score and the criterion, it is necessary to obtain the 
mean criterion score for those selecting each alternative, and then to 
assign weights such that the differences in weights are proportional to 
the differences in mean criterion scores. A simple method of doing this 
is to assign the value 0 to the alternative with the lowest mean criterion 
score, and then to subtract this mean from each of the others, and assign 
a rounded fraction of this difference in means as the weight for the alter- 
native. For example, in weighting it is probably desirable to limit the 
weights to the integers from 0 to 5, or from 0 to 10. Wilks (1938) has 
shown that under certain fairly general sets of assumptions, the correla- 
tion between one linear composite and another composite using different 
weights will differ from unity by about 1 /n, where n is the number of 
different elements entering into each weighted composite (see equation 
47, Chapter 20). That is, elaborate weighting systems with fractional 


or negative weights probably should be avoided. The use of 0 and 1 
; situations. 


or of 0, 1, and 2 is enough for most n e» 

It is also possible to score a five-alternative multiple choice item by 
assigning a different weight to each alternative. The usual procedure 
is to select one of the alternatives as correct, and to score any one of the 
other four zero. Sometimes item analysis data indicate clearly that the 
persons selecting one wrong answer are much better or poorer than those 
selecting another wrong answer. If it is possible to standardize a test 
on five or ten thousand cases, it might be worth while to consider the 
possibility of differential weighting for each alternative. The most 
common plan at present is to use such detailed item analysis data only 
for the purpose of discarding the poorer distractors, since it is felt that 


scoring items either 1 or 0 is highly desirable. 
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9. Scoring of rank-order items . 

In testing for certain types of knowledge it is frequently convenient 
to require the student to arrange the alternative answers in order with 
respect to some characteristic. If we wish to test for knowledge of 
chronology without the use of dates, a series of three to ten events can 
be listed and the student required to number them in order from the 
earliest to the latest. In testing for the student's appreciation of a given 
philosophical or political viewpoint, it is possible to present three to 
five arguments for a given line of action, and require the student to 
mark 1 for the argument most likely to be used by a socialist (for 
example), 2 for the argument next most likely to be used, and so on to 
5 for the argument that is least likely to be used by a person with 
such a viewpoint. In order to test for very fine discrimination in 
any field, it is possible to ask a question, and then present the student 
with three to five answers. The task is for the student to grade these 
answers, just as if he were the instructor in the course, by ranking them 
in order from best to worst. In all cases like the foregoing, we have a 
problem of how to grade rank-order items. 

To simply prepare a key giving the correct order and then give one 
point for each agreement between the key and the student's ranking is 
clearly a poor method. For example, if the correct order is 1, 2, 3, 4, 
the answer 2, 1, 4, 3 shows zero agreement with the key, and so does the 
answer 4, 3, 2, 1. Yet the first is clearly better than the second. One 
easy method for scoring such items is to insist that every item be correct 
in order for credit to be given; any error regardless of how many is given 
zero credit. Usually the subject matter expert deems such a method 
unsatisfactory. The person who makes only one inversion (hence has 
two disagreements with the key) is clearly better than the person who 
has things mixed up all along theline. A better method is to secure the 
differences between the rank order given by the subject and the rank 
order given by the key. For an elaborate Scoring procedure we should 


square these differences, sum them, and compute the rank correlation 
by the formula 


62d? 
(10) Hei: : 

nê — n 
where Zd? is the sum of the squares of rank 
number of items ranked. 


differences and n is the 
However, the computation of a correlation 
coefficient for each such item on each paper introduces both considerable 
labor and considerable probability of error. 

A simple and satisfactory method for scoring such items is t 


1 o use the 
sum of the absolute differences. 


If the rest of the examination is scored 
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in number of errors, this sum can be added directly in with the errors. 
If the rest of the examination is scored in terms of number correct, this 
sum can be subtracted from some constant to give zero disagreement 
the highest score and great disagreement the lowest score. The formula 


would be 
(11) Score = C — X|d|, 


where X| d | is the sum of differences ignoring sign, and 
C is a constant larger than the greatest >| d | we are likely 
to find. (If in scoring we find a few negative differences, 
these may be counted as zero.) 


In ranking three items a still simpler method is available. Ask the 
student to mark + for the best of the three, 0 for the poorest, and to 
leave the middle one blank. For all students who have marked the item 
with one +, one 0, and one blank, the papers can be scored by matching 
the key against the student's item and scoring only the alternative 
keyed + and the alternative keyed 0. The student gets one point for 
each of these, which means that he receives two points for perfect 
agreement with the key, one point if either the best or the poorest 
alternative has been confused with the middle one, and zero for all more 
In order to secure more different scores, it is possible 
to assign two points for agreement with the key on + and two points for 
agreement on 0; one point for leaving either the + or the zero alternative 
blank; and no credit for marking with the wrong symbol. Such a scoring 
system gives four points for perfect agreement; three points if the only 
error is a confusion of either the best or worst alternative with the middle 
one; zero for a complete reversal of the correct order; and one point 
for only one inversion from this worst order. It is not possible to get 
two points with this system if the student follows the direetions. Such 
a scoring plan makes possible the rapid scoring of rank-order items, and 
has given scores correlating highly with total score in many instances. 

It should be noted that with rank-order items theorems involving K 
(number of items) cannot be applied except for parallel tests that contain 


matched rank-order type items. 


Serious confusions. 


10. Summary 
For most tests in the aptitude and achievement field, it has been 


found that the number of items answered correctly, or the number of 

errors, is an eminently satisfactory score. The added labor of using a 
s, is a d : E 

weighted composite of errors and correct responses 1s worth while only 


in certain special cases. 
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The notation used in the special scoring formulas is: 


R (rights), the number of items marked correctly. 

W (wrongs), the number of items marked incorrectly. 

S (skips), the number of items that have not been marked but are 
followed by marked items. 

U (unattempted), the number of consecutive items at the end of the 
test that are not marked. 

B (blank), the number of unmarked items (S + U). 

A (alternatives), the number of possible answers listed for each ques- 
tion. It is assumed here that the same number of alternatives 
are presented for each question. 


If a test that has been designed primarily as a power test turns out 
to have a large number of unattempted (U) items on some papers, and 
a small number on others, it may be that some students are using the 
last two minutes of testing time to mark answers without reading items 
in order to get the benefit of a chance score. In order to score the 
examination fairly, in spite of considerable variation in guessing from 
one person to another, one of the following weighted composites should 
be used. 


1 
1 Xp =R —)B 
(1) B +E) 
or 
W 
(2) Ree Sais — mee. 
A—I 
or 
U 
(6) DUE =: Rt 
U "Ww 


Equations 1 and 2 correlate perfectly but do not give identical scores. 

If a test is designed primarily as a speed test, and we find that there 
has been a considerable number of items skipped (S-score) or answered 
incorrectly (W-score), it may be desirable to introduce a small penalty 
for skips and a larger penalty for errors. The formula suggested was 


(7) ee 5 


where C and D are arbitrary constants, C 
errors (W) more than skips (S). 
slightly larger than the number of 
than A — 1. 


: < D, in order to penalize 
Tt is perhaps appropriate to make D 
alternatives A, and C slightly smaller 
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If time and error scores are to be weighted in determining a composite 
score, and if no criterion is available, judgments must be relied on to 
determine relative weights. Such problems may be handled by the 
“indifference function” technique, or by a rapid and crude graphic 
method as illustrated in Figure 1. If a criterion is available, multiple 
correlation methods can be used as indicated by formulas 8 and 9. 

If we are dealing with items that do not have a clear-cut correct 
answer, such as items in a personality questionnaire, it is necessary to 
have a criterion in order to set up the scoring key. In order to maximize 
the correlation between score and criterion, weights should be assigned 
the different alternatives so that the differences in weights will be pro- 
portional to the differences in mean criterion score for the persons choos- 
ing each alternative. See the procedure given by Guttman in Horst 
(1941, page 341). E 

Rank-order items may be quickly scored by an approximation to a 
correlation coefficient given by 


(11) Score = C — 3| d |, 


where =| d | is the sum of absolute differences in rank between the cor- 
rect order and the order assigned by the student, and 
C is any arbitrary constant, larger than the greatest possible 


x|4|. 
A still simpler system suitable only for the ranking of three items is also 
described in section 9. 
Problems 


1. Derive the formula for the correlation between number correct and number 
incorrect, for an objective test, assuming that there are no omissions. 

2. Derive the formula for the correlation between number correct. and number 
incorrect for an objective test. Assume that there are omissions and express the 
results in terms of the variances of number right and number omitted and the correla- 
lion between these two variables. A 

3. Derive the formula for correction for chance success in a test each item of which 
has one correct and six incorrect choices. State clearly cach assumption used in the 


derivation. 


4. Comment briefly on the material in Moore’s 1940 article. 


19 


Methods of Standardizing 
and Equating Scores 


1. Introduction 


After having decided on an appropriate scoring system for the test, 
as indicated in the preceding chapter, we must make some decisions 
with reference to the distribution of gross scores obtained. In a schol- 
arship examination some are awarded the scholarships, and the rest 
are not. For Civil Service Examinations, certain persons are placed on 
the eligible list, while others are considered ineligible for certain types 
of jobs. In the examinations given by the College Entrance Examina- 
tion Board the scores are converted to a certain standard form and re- 
ported to college admissions officers, who use these scores along with 
other information in deciding which applicants to accept and which to 
reject. In a college achievement test, given by an instructor in his 
course, it is necessary to decide which students failed the examination, 
which ones made an A grade, which ones made a B grade, etc. In gen- 
eral, we may say that, in using the scores from an examination, it is 
necessary to determine one or more “critical scores" or to report the 
results in some standardized form to persons who will make such deci- 
sions, and possibly study the relationship of these scores to other var- 
lables. We shall now consider various factors in, and the different 


methods available for, determining critical scores and for standardizing 
test scores. 


2. Assessing the gross Score distribution 


For every test, regardless of th 
is desirable to make a frequency 
and to inspect this distribution c 
test, it is desirable for a test te 
with a subject matter expert, 
important points. 


e standardizing system to be used, it 
distribution of the gross, or raw, scores 
arefully. If the test is an achievement 
chnician to discuss the various points 
since either one alone might overlook 
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The first points to note about any test are the number of items (K) 
and the number of alternative answers (4) presented for each item. 
From this information we can determine three quantities very impor- 
tant in evaluating any distribution of scores. These quantities are: 


l. The perfect score, which is usually equal to K, the number of 
items in the test. 

2. The average chance score Me, which is usually equal to K/A. 

3. The variance of a distribution of chance scores, which is Kp(1 — p), 
where K is the number of items and p is the probability of answer- 
ing an item correctly. If p is taken as 1/4, the variance of a dis- 
tribution of chance scores becomes K(A — 1)/4?, and the standard 
deviation of the distribution of chance scores is 


V K(A 1) 


(1) io = Ki 


It should be noted that these considerations regarding the magnitude 
of a chance score apply only to power tests or to tests that are primarily 
power tests. In speed tests it is necessary to be certain that the number 
of errors made is negligible; methods for determining this have been 
discussed in Chapters 17 and 18. These three quantities (K, K/A, 
and s.) will show the possible meaningful score range for the test. A 
score that is within one or two standard deviations (s,) of a chance 
score should not be interpreted as signifying any knowledge of the 
subject matter of the examination. For example, if we take the standard 
that the score must exceed the average chance score by more than 2s,, 
then, for a 25-item test of 5-choice items, we should have a perfect score 
of 25, an average chance score of 5, and, since s, is 2, a reasonable upper 
limit for chance scores may be set at 5 -- 2X 2 — 9. That is, this 
examination has only 16 possible scores (10 to 25 inclusive) that could 
indicate varying degrees of achievement in the field. On the same 
basis, we see that a 10-item true-false quiz has only two possible scores 
(9 and 10) that could indicate varying degrees of achievement in the 
field. Asa first check on any examination, it is well to be certain that 
the lowest score that is taken as indicative of knowledge is well above 
the average chance score, and to be certain that the number of possible 
scores between this lowest score and the highest score is considerably 
greater than the number of subgroups we wish to determine from the 
test. 

Having obtained the lowest non-chance score and the perfect score 
from knowing only the number of items, and the number of choices per 
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item, we next make a frequency distribution of scores and find the 
mean and standard deviation of this distribution. It is also necessary 
to use some method of determining the reliability of the test and the 
error of measurement in order to compare the error of measurement 
with the score range between the upper and lower bound of any given 
subgroup. For example, if an achievement examination is being used 
to divide students into A’s, B’s, C’s, and D’s, it would seem desirable to 
have the score range of about three or four times the error of measure- 
ment from the lowest B to the highest B. We can readily see that, if 
this score range is equal only to the error of measurement, then through 
examination error alone quite a few students who should receive A’s 
will receive C’s, and vice versa. Errors in classification of students at 
the borderline between A and B, between B and C, ete., cannot be 
avoided under any circumstances. It is possible, however, by making 
the distance between the upper and lower bound of any one subclass 
large, in comparison with the error of measurement, to be relatively 
certain that errors of classification of iwo or more groups will be avoided. 

In general the significant or important distances on the scale, such 
as the distances between different critical scores or the differences be- 
tween successive school grades or successive years, should be very large 
in comparison with the error of measurement of the test. A difference 
as large as this is necessary in order to insure that important decisions 
are not made on the basis of accidental fluctuations. 

It should be noted that nothing has been said about “per cent of a 
perfect score" as one of the criteria for judging the distribution of raw 
Scores. Unless we have very thorough procedures for pretesting items 
so that item difficulty and test reliability are equated from one examina- 
tion to another, the amount of knowledge indicated by a given per cent 
of perfect score will vary tremendously from one examination to another. 
If an examination is composed of items that are answered correctly on 
the average by 80 per cent of the students, the average student will make 
80 per cent of a perfect score, and the upper half of the class will be 
grouped in the narrow score range between 80 per cent of perfect and 
perfect. Compare such an examination with one which attempts to 
diseriminate between the good student and the very superior student. 
This latter examination would probably be composed of items answered 
correctly on the average by 50 per cent of the Students so that the 
average student would make 50 per cent of a perfect score, and the wide 
score range from 50 per cent correct to perfect would be available for 
distinguishing between various degrees of ability in the upper half of 
the class. Scoring these two examinations on the basis of per cent of a 
perfect score would not give satisfactory results. Each distribution 
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must be inspected to determine where the average score and scores one, 
two, and three standard deviations above and below average lie with 
respect to the perfect score and the lowest non-chance score. The 
judges should then select the various critical score points, taking care 
to make the distance between these points reasonably greater than the 
error of measurement of the test. 

The effect on standards of requiring “successive hurdles" versus per- 
mitting multiple attempts to pass the examination must be considered 
in setting any critical score. The term successive hurdles is used to 
designate a procedure whereby the successful candidate must have 
passed each of several tests; failure on any one disqualifies the person. 
If such an administrative procedure is being followed, it is essential to 
pass many more at any given step than are desired to pass the total 
procedure. On the other hand, the effect of permitting multiple at- 
tempts is to lower standards, particularly if the examination is unreli- 
able. In effect this is the opposite of the successive hurdles procedure 
in which a single failure disqualifies the candidate. If many trials are 
allowed, the candidate is usually passed if he succeeds in any one attempt. 
In order not to be accepted the candidate must fail in every attempt. 
It is clear then that, if multiple trials are permitted, it is well to err on 
the side of fixing the lowest passing score too high; whereas, if a succes- 
sive hurdles procedure is followed, it is well to err on the side of fixing 


the lowest passing score too low. 


3. Standardizing by expert judgment, using an arbitrary scale 

In some cases the major interest of the subject matter expert lies in 
one or two critical scores; yet it is desirable, or required by some regula- 
tion, that many different score values be reported. For example, in 
some colleges it is conventional to grade on a scale from 100 (repre- 
senting a perfect score) to 65 or 70 (representing the failure line). In 
Navy schools it is conventional to grade on a scale from 4.0 (represent- 
ing a perfect score) to 1.0, or possibly lower, for the poorest possible 
performance. On this scale 2.5 is a critical score (the lowest passing 
grade). In Civil Service ratings 70 is defined by regulations as the mark 
to be assigned the lowest acceptable performance, and 100 is the highest 
mark to be assigned. , 

In making any transformation from a given raw score scale to some 
conventional scale with critical limiting values, the simplest and best | 
procedure is to determine the limiting values carefully and then to make 
a linear interpolation between these values. Such a procedure is de- 
scribed for use in Navy schools by Stuit (1947), pages 485-487. A simi- 
lar procedure for converting raw scores into Civil Service ratings is 
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described by Adkins et al. (1947), pages 194-202. The simplest method 
is a graphic one. 

1. Prepare a graph in which the various possible raw score values are 
indicated on one axis, and the various possible values of the desired 
arbitrary scale are indicated on the other axis. 

2. Prepare a frequency distribution of the raw score values with the 
various key points such as the mean, standard deviation, standard error 
of measurement, perfect score, average chance score, and average 
chance score plus one or two times the standard deviation of such scores 
(see equation 1) indicated. 

3. Determine the raw score corresponding to some critical level, such 
as the lowest passing mark. In determining this point all relevant fac- 
tors must be considered, such as the probable diffieulty level of the 
examination, the standards it is necessary to maintain, and the number 
and per cent of candidates above or below this critical point. Paren- 
thetically, it may be remarked that sometimes a committee will feel 
that it is desirable to look for gaps in the score distribution, and to set 
the lowest passing mark just above such a gap. As pointed out by 
Adkins et al. (1947), pages 197-198, such gaps are purely accidental and 
should be ignored in favor of more rational considerations in determin- 
ing the critical points. 

4. Determine the raw score corresponding to another fixed point, 
such as the highest score to be assigned. The top score of 100, 99, or 
4.0 need not be assigned to a raw score that corresponds to a perfect 
paper. If the examination is very difficult, it might be desirable to 
take a raw score considerably lower than the perfect one to correspond 
to the highest assigned score. At the other extreme, if the examination 
is very easy so that, for instance, 5 or 10 per cent of the persons made a 
perfect raw score, it might be desirable to assign a score below the 
highest allowable (such as 80 or 3.5) to the perfect raw score. 

5. Plot the points determined in steps 3 and 4 on the graph, and con- 
nect them with a straight line. From this line it is possible to read off 
the transformed score corresponding to each raw score. 

By repeating steps analogous to 3 and 4 for other critical points it is 
possible to set up several different linear transformations in different 
parts of the scale, should that appear desirable, 


4. Transformations to indicate the individual's standing in 
his group—general considerations 

In many testing situations it is not possible or desirable or necessary 
to make immediate decisions for action on the basis of the gross score 


distribution. In such situations it is conventional to transform the 
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gross scores into some uniform set of numbers that indicates the relative 
standing of the individual in his group. For example, transformed 
scores on the tests of the College Entrance Examination Board are re- 
ported to the designated colleges. The admissions officer of each college 
determines which scores will be regarded as critical for purposes of ad- 
mission to his institution. In the aptitude testing programs of the 
Army, Navy, and Air Forces, during the second World War, the tests 
were given and the transformed scores made a part of the man’s per- 
manent record. As experience accumulated regarding the performance 
of men in different schools and jobs or as the relative needs in the dif- 
ferent schools changed, the critical score requirements could be specified 
and altered. 

Four different gross score transformations that indicate the relative 
standing of the individual in his group will be considered: (1) linear 
transformations, including (a) standard score and (b) linear derived 
Scores; (2) non-linear transformations, including (a) percentile score 
and (b) normalized score. 

In using standard, linear derived, percentile, or normalized Scores, we 
Should bear in mind that such scores indicate only the relationship of . 
the individual to a given group. They indicate nothing about the gen- 
eral level of knowledge or attainment of the group or its members. For 
example, a set of percentile or standard scores on a test in American 
history would not indicate whether the students had a comprehensive 
grasp of the major items in American history or only a very meager 
knowledge, Such an assessment must be based on the judgment of 
subject matter experts, and can never be determined by clever quanti- 
tative scoring devices. In setting up the test, the Judgment of the sub- 
ject matter expert is used to include a good sampling of items from the 
field, and as indicated in section 2 of this chapter the subject matter 
expert must assess the gross score distribution, with the help of a test 
technician, in order to determine critical scores between the chance 
Score and the perfect score. From this point of view the most satisfac- 
tory testing programs are those closely related to training programs so 
that the subject matter expert may, for instance, judge: “The perform- 
ance of these students is unsatisfactory; I will step up the quality and 
quantity of work demanded in the training program so that the next class 
will make a higher average gross score than this one has made." By 
using the same or parallel tests, it is then possible to see whether or not 
the altered training program has produced the desired result of a higher 
test score. For an illustration of such a use of testing in conjunction 
with training programs, see Stuit (1947), pages 303-313. The blind use 
of group norms, such as the standard, percentile, or normalized scores, 


268 The Theory of Mental Tests [Chap. 19 


without any assessment of the absolute level of achievement in terms of 
judgments of subject matter experts, may serve to conceal marked in- 
adequacy of training standards. 


° 
5. Linear transformations—standard score 

The basic linear transformation of gross scores is known as the 
“standard score.” The individual’s score is expressed as a deviation 
from the mean of the distribution (that is, the mean is taken as the 
origin or zero point). The score unit is taken as the standard deviation 


of the gross score distribution. Standard scores will thus have a mean 


of zero and a standard deviation of unity. Using z; to designate the 
standard score of the ith individual, we may write 


(2) a= : 


where X; is the gross score of the ¿th individual, 
# is the population mean, and 
a is the standard deviation of the population. 


Since the mean and standard deviation of the population are usually 
unknown, it is conventional to have a large sample and to use the mean 
and standard deviation of this sample in computing standard scores. 
The formula may be written 


X;— Mx 
(3) zi = E 
sx 


where Mx and sx are the mean and standard deviation of the distribu- 
tion of gross scores (X). The numerator of equation 3 is frequently 
designated by the lower case x (t: = X; — My), and is referred to asa ` 
deviation score. The term “deviation score” 
expressed in terms of deviati 
the population mean and st 
are usually unknown, it is 


of equation 2 in calculating standard scores. However, the general 
problem of standardizing several different forms of a test or of using 


are involved becomes 


which the standard scores 
likelihood estim: 
this population. 


are to be computed, and then to use maximum 
ates of the gross score mean and standard deviation of 


2 
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From equations 2 or 3, we see that a standard score is a score in 
which the mean of the distribution is zero and its standard deviation is 
unity. All standard scores above the mean will be positive, and all 
below the mean will be negative. A person whose raw score is at the 
mean of the distribution will have a standard score of zero, since for 
that person 

Mx — Mx 
—————— = 0.. 
sy 
A person whose score is one standard deviation above the mean will 
have a standard score of 1.00, since X; — Mx = sy; hence equation 3 
equals 1.00. 

In order to use equation 3 for computing the standard score equivalent 

to each gross score, it is convenient to rewrite it in the form 


1 ` My 
(4) a= (2) K=; 
Sx Sx 


In computing z-scores when X; > JM x, enter — M y/sx in the computing 
machine, clear the keyboard, put the quantity (1/sx) in the keyboard, 


TABLE 1 
Frequency DISTRIBUTION 
Score eree F fo fo? 
quency " 
E 
120-129 2 =5 —10 50 | Assumed mean 174.5 
130-139 3 =" = 19 48 plus 
140-149 12 n —36 108 | correction term =O 
150-159 23 EC —46 92 equals —— 
160-169 37 c 07 37 | gross score mean 178.5 
170-179 51 0 al Bae 
180-189 39 | +1) +39 Ed Ede Sie Gp es 
sd Rete. Bu uci .98 — 0.01 = 2.97 
200-209 9 |43| +27 81 
210-219 2 | +4 +8 32 | Gross score variance = 2.97 (CI)? 
220-229 1 (+5) +5 25 = 297.0 
Column sums! 200 —20 596 | Gross score standard deviation 
Sums/N —0.1 |2.08| -4/297 = 17.234 
Correction 
temm (CI) = —1.0 Class Interval (CI) = 10 
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and add it in X; times, where X; is the gross score value just above the 
mean score. Record the z-value corresponding to this X-score, and then 
add in (1/sx) once more for the next higher score, and so on until the 
z-value for the "highest attained or the highest possible X-score has 
been found. When X; < Mx, the procedure is similar, except that all 
the signs of the quantities must be reversed. Enter +(Mx/s x) in the 
machine, put in the quantity (1/sx), and subtract it once to give the 
z-score corresponding to a gross score of 1, twice to give the value cor- 
responding to a gross score of 2, and so on until the value X, is reached, 
where X, is the gross score value just below the mean. All the z-scores 
corresponding to X-scores below the mean must be given negative signs. 

This computing procedure is illustrated with the frequency distribu- 
tion of 200 cases shown in Table 1 on the preceding page. This fre- 
quency distribution has a mean of 173.5, a variance of 297.0, and a 


standard deviation of 17.234. Substituting these values in equation 3 
gives 


X; — 173.5 
CSS 
17.234 
The computation equation 4 thus becomes 
1 " 173.5 
gj = ——— X.-— 


17.234 ° — 17.284 
or 


zi = 0.058025X; — 10.067309. 


This equation is used directly in the computing machine for computing 
the standard score equivalent of all gross scores above the mean. Table 2 
illustrates a worksheet, used to compute a standard score equivalent for 
the midpoint of each class interval The entries below the horizontal line 
in Table 2, where X; takes in succession the values from 174.5 to 224.5, 
correspond to scores higher than the mean of the distribution. Since 
the class interval in Table 1 is 10, the coefficient in the computing equa- 
tion is multiplied by 10X at each step, instead of by X. For the nega- 


tive entries above the horizontal line in Table 2 the equation used in the 
computing machine is 


=z; = 10.067309 — 0.058025X ;, 


where X; takes in succession the values from 124.5 to 164.5 shown in 
the column labeled X in Table 2. Column z gives the standard scores 


1 For linear derived scores to be discussed in the next 
sheet will be shown that gives a derived score e 
Score. This procedure, illustrated in Table 3, 
standard score equivalent is desired for each g 


section, a convenient work- 
quivalent for each different gross 


may be adapted to z-scores if a 
ross Score, 


_ 
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to six decimal places. However, standard scores of psychological tests 
should at most be given to two decimal places, as shown in column 2’. 
It may be noted that, even with a test reliability as high as .99, the 
error of measurement is * 


s, V1— 99 = s:V 01 = .1s;, 


so that the error of measurement is greater than one-tenth the standard 
deviation for practically all tests. 


TABLE 2 


Computinc Form ror STANDARD Scores 


X z z' 

124.5 | —2.943196 | —2.84 

134.5 | —2.262946 | —2.26 

1445 | —1.682696 | —1.68 | —z: = 10.007309 — 0.058025X ; 
154.5 | —1.102446 | —1.10 

164.5 | —0.522196 | —0.52 

174.5 | +0.058054 | -+0.06 

184.5 | +0.638304 | +0.64 

iot 5 |.r1:218555|-El28 | . nass 

-— Wu — 3E M 
214.5 | 12.379054 | +2.38 

221.5 | +2.950304 | +2.96 


In order to check the entries in Table 2, the differences between adja- 
cent entries should be computed. In this table these differences are 
each equal to 58, which is ten times the multiplying coefficient in the 
computing equation. It is also desirable to recompute the entries for 
about three selected points, one near the middle and one near each end 
of the scale. Gross errors may also be detected by computing the gross 
score equivalents for —3, —2, —1, 0, +1, +2, and +3 standard 
deviations from the mean to see that these scores fall in the proper 
intervals. 

If a graphic method of setting up the transformation from X-scores 
to z-seores is preferred, the simplest method is to set up appropriate 
coordinates on a graph, including the range of X-scores on one axis and 
2-scores from about —3.0 to +3.0 on the other axis. Select one X-score 
approximately —2 or —3 standard deviations below the mean, and 


> 


calculate the corresponding z-score. Do the same for a high X-score 


y 
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approximately +2 or +3 standard deviations above the mean. Plot 
these two points on the graph, and connect them with a straight line. 
Several check points may then be selected. For example, a z-score of 
zero should correspond exactly to M x on the X-seale, and scores that 
are one standard deviation above and below the mean on the X-scale 
should correspond to z-values of plus one and minus one, respectively. 
The standard score is primarily useful for theoretical purposes. For 
example, it simplifies algebraic derivations involving variances and co- 
variances; and tables of the normal curve have z-scores as one of their 
entries. However, it has marked disadvantages as a method of report- 
ing scores for the individuals of a group. The range from —3 to +3 is 
awkward since it necessitates the use of negative and positive numbers. 
Also, in order to have a sufficient number of different scores, it is neces- 
sary to use decimals. It is conventional, therefore, to use some more 
convenient linear transformation of standard scores for reporting pur- 
poses. These, termed linear derived scores, will be considered in the 
` next section. 


6. Linear transformations—linear derived scores 


Since the standard score (z-score) with a mean of zero and a standard 
deviation of unity necessitates using negative and decimal scores, it is 
usual to report scores in terms of some arbitrary distribution that has a 
standard deviation considerably greater than unity and a mean that is 
four or five times the standard deviation. Such a set of units, called 
here linear derived scores, avoids both negative and fractional scores. 

Several different transformations of this type have been found useful 
in different circumstances. For example, the Board of Examinations at 
the University of Chicago has used a*linear derived score with a mean 
of 20 and a standard deviation of 4. Most scores would thus lie between 
8 and 32; and, even if an occasional score of plus or minus five standard 
deviations were found, we should still have scores ranging only from 
0 to 40. Such scores would not be confused easily with percentile scores 
that were used in reporting some of the entrance tests, and a class inter- 
val of one-fourth standard deviation is convenient; for computing var- 
iances and correlations so that decimal scores need not be used. The 
College Entrance Examination Board adopted a linear derived score 
system for reporting scores on its examinations to the colleges. These 
scores have a mean of 500 and a standard deviation of 100. They 
range from a lower limit of 200 to an upper limit of 800, and cannot 
possibly be confused with percentile ratings, grade ratings (with 100 as 
perfect and 60 or 65 as failure), mental age ratings (in the 10 to 20 

` range), or I.Q. ratings (in the 100 to 150 range) that may appear on the 
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applicant’s secondary school record. Because such scores would be un- 
wieldy to record or to use in IBM card operations, the College Entrance 
Examination Board also adopted another linear derived score system 
for use within the office for keeping certain records, computing correla- 
tions, making item analyses, etc. This system uses a mean of 13 and a 
standard deviation of 4. The particular advantage of this scale is that 
the scores ean be recorded in two columns of an IBM card, and the 
squares of the scores can be recorded in three columns. A score as 
large as 4.50 would be 31, and the square of 31 is 961. Using five col- 
umns of the card to record the score and the square of the score facilitates 
many operations that require computing sums and sums of squares. 

During the second World War, the United States Navy used a basic 
aptitude test battery and reported scores in terms of a linear derived 
scale with a mean of 50 and a standard deviation of 10. Such a scale 
could be reported in two columns of an IBM card. Moreover, as long 
as operations requiring sums of squares were not used to a great extent, 
maximum use was made of the IBM cards, and a scale had reasonably 
fine subdivisions. The United States Army used an aptitude test bat- 
tery and reported the scores on a linear derived scale with a mean of 100 
and a standard deviation of 20. This made the scale somewhat compa- 
rable to the I.Q. scale so that not too much change in habits regarding 
meaning of the numbers was required to make reasonable judgments , 
for the new test scores. These examples illustrate some of the types of 
linear derived scores in use, and indicate some of the reasons for select- 
ing given arbitrary values for the mean and standard deviation of the 
derived score scale. x 

In order to determine the formula for computing any of these linear 
derived scores, let us use w; to designate the linear derived score of the 


ith individual and write 
(5) Wi = Swi + Mv, 


Where M, is the value that has been selected as convenient for the mean 
of the linear derived scores, and 
Sw is the value that has been selected as convenient for the stand- 


ard deviation of these scores. 


Since the standard deviation of the z-scores is unity, multiplying each 
z-score by s, will give a set of scores with a standard deviation equal to 
Sw. Also, since the mean z-score is zero, adding Mw to each score will 
give a set of scores with a mean equal to Mw. Thus the transformation 
of equation 5 insures that the new scores will have the desired mean and 


Standard deviation. 
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To express the w-scores directly in terms of the gross scores, substitute 
equation 4 in equation 5 and write 


© w = (2) xi + ate - (2) ars, 


Sx Sx 


where the terms have the same definitions as in equations 3 and 5. 
The computing procedure is similar to that for equation 4, except that 
no provision need be made for negative scores, since Mw and sw are 
selected so that all scores are positive. The procedure is to enter 
My — (Su/sx)M x in the keyboard, put it into the machine, clear the 
keyboard, and enter (s,,/sx). Add this quantity once to obtain the 
w-score equivalent to an X-score of one, twice for the equivalent of an 
X-score of two, and so on until the highest X-score has been reached. 
Again a graphic check can be made by computing the w-score equal to 
a very low X-score, and to a very high X-score, plotting these two 
points on a graph, connecting them with a straight line, and then com- 
puting several intermediate check points. 

Linear derived scores (including, of course, standard scores) have this 
very valuable property: the characteristics of the original distribution 
of gross scores are duplicated in the transformed scores. The indices of 
skewness and kurtosis for the distribution of gross scores are identical 
with the indices for the distribution of linear derived scores, and both 
sets of scores will have the same correlation with any other variable. 
Non-linear transformations of gross scores will in general have indices 
of skewness, kurtosis, and correlation that are different from those of 
the original gross scores. 

The data of Table 1 are used to illustrate the computation of linear 
derived scores with a mean of 500 and a standard deviation of 100. 
Substituting these values for M, and s,, and the mean and standard 
deviation of Table 1 for Mx and sx in equation 6, gives the equation 


— ig +500 — 200 azs 
= 17284^" Tagi cn 


which may be written as the computing equation 
w; = 5.8025X; — 506.7309. 


The rectangular layout of Table 3 furnishes a convenient method of 
recording a linear derived score equivalent for each gross score of 
120 to 229. The computing procedure is to enter the additive term 
(—506.7309) in the keyboard and into the machine, then to clear the 
keyboard and to enter the coefficient (+-5.8025). This coefficient is 
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then multiplied by 120 to give the first entry (190). One additional 
rotation of the machine is needed-to give each of the remaining 109 
entries in Table 3. The results are recorded to only three digits, which 
corresponds to units of one-hundredth of a standard deviation. The 
best method for checking a table like "Table 3 is first to compute suc- 
cessive differences. These differences should each be equal to the con- 
stant term, which in the present illustration is about 5.8, so that-to the 


TABLE 3 


COMPUTING FORM ror LINEAR DERIVED Scores 


100 
E 800 E 
wi = 119g, Xt + 900 7 1755, 1795 


or 
w; = 5.8025X; — 506.7309 


12- | 190 | 195 | 201 | 207 | 213 | 219 | 224 | 230 | 236 | 242 
13- | 248 | 253 | 259 | 265 | 271 | 277 | 282 | 288 | 294 | 300 
l4- | 306 | 311 | 317 | 323 | 329 | 335 | 340 | 346 | 352 | 358 
15- | 364 | 369 | 375 | 381 | 387 | 393 | 398 | 404 | 410 | 416 
16- | 492 | 497 | 433 | 439 | 445 | 451 | 456 | 462 | 468 | 474 
17- | 480 | 485 | 491 | 497 | 503 | 509 | 515 | 520 | 526 | 532 
is- | 538 | 544 | 549 | 555 | 561 | 567 | 573 | 578 | 584 | 500 
19- | 596 | 602 | 607 | 613 | 619 | 625 | 631 | 636 | 642 | 648 
20- 654 660 | 665 | 671 | 677 | 683 | 689 | 694 | 700 | 706 
21- | 712 | 718 | 723 | 729 | 735 | 741 | 747 | 752 | 758 | 764 
22- | 770 | 776 | 781 | 787 | 793 | 799 | 805 | 810 | 816 | 822 


nearest unit the difference is usually 6, with an occasional 5. Second, 
to check for gross errors, it is desirable to determine the gross score 
points corresponding to —3, —2, —1, 0, +1, +2, and +3 standard 
deviations from the mean and to see that these are, respectively, 200, 
300, 400, 500, 600, 700, and 800. 

Linear derived scores, like standard scores, may also be obtained 
graphically. The best procedure is to set up the gross score and the 
derived score scale in suitable units on graph paper. The gross score 
and corresponding linear derived score are then found for three points, 
such as —3, 0, and +3 standard deviations from the mean. These 
three points are plotted; they should lie in a straight line. This line is 
the transformation line from which the derived score equivalent for a 
given gross score, or the reverse, may be read. 
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Let us eontrast the properties of linear transformations of gross 
scores (such as standard and other derived scores) with the properties 
of non-linear transformations (percentile and normalized scores) to be 
considered next. 


1. The linear transformation involves no assumptions about the 
distribution of the population or the sample. It has third and 
fourth moments identical with those of the raw score distribution. 
This fact has several important consequences. 

2. It is possible to tell from the distribution of transformed scores 
whether the test was too easy, too difficult, or about the correct 
difficulty level for the group. 

3. Since the correlation between gross scores is identical with the 
correlation between linear transformations of gross scores, the 
equations dealing with the effect of test length and group hetero- 
geneity on reliability and validity (see Chapters 6 to 13) hold for 
gross scores and for any linear transformation of gross scores. The 
equations developed in Chapters 6 to 13 do not necessarily hold 
for non-linear transformations of gross scores. 

4, Equating various forms of tests is simpler if some linear transforma- 
tion is used, since such a transformation depends only on estimat- 
ing two parameters, the mean and the variance. The theory for 
equating when transformations are non-linear is more difficult to 
develop, and probably will give results with greater sampling errors. 


7. Non-linear transformations—percentile ranks 


We shall consider only the two most commonly used non-linear 
transformations, namely, percentile scores and normalized scores. 

A given individual’s percentile score indicates the percentage of per- 
sons in the distribution who score less than that individual. Consider a 
distribution of ten cases, each of which makes a different gross score. 
Each person is considered to occupy one-tenth of the entire 


: percentile 
range from 0 to 100, as illustrated: 
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The score assigned to each person is the midpoint ti the range occupied 
by that person so that, for a distribution of ten persons, the percentile 
scores will be 5, 15, ---, 95, as indicated above. If several different 
persons make the same score, each person’s score is the midpoint of the 
range occupied by all of them. In terms of the foregoing illustration, 


assume that the lowest three persons made the same score and that the 
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second and third from the top made the same score. In such a case we 
should have 


0 10 20 30 40 50 60 70 80 90 100 
I — — €——Ü ee ee Le dL 
A. \— 
15 35 65 0 


three percentile scores of 15 and two of 80, as illustrated. For a distri- 
bution of 100 cases, each person having a different score, the percentile 
scores would begin with 0.5 and proceed by unit steps to 99.5. 

Let us use the data of Table 1 to illustrate the general procedure for 
calculating a percentile equivalent for the midpoint of each class interval. 
Table 4 illustrates the computation of a percentile corresponding to the 
midpoint of each class interval and to the boundary between the class 
intervals. The midpercentile for a given interval is assigned to all the 
cases in that interval. The procedure is to compute 100/2N, enter this 
figure in the calculating machine keyboard, and multiply by the number 
of cases in the lowest class interval to obtain the midpercentile for that 
interval. Multiplying a second time by the number of cases in the 
lowest class interval gives the percentile corresponding to the upper 
bound of the lowest, and the lower bound of the next class interval. 
We then multiply twice by the frequency in the next class interval, and 
so on until the percentile score 100 is reached. FN 

In Table 4 this procedure is illustrated for a distribution of 200 cases. 
The quantity 100/2N is 0.25. This quantity is entered in the machine 
and multiplied by 2, giving 0.50, then by 2 again, giving 1.00. Thus the 
percentile score assigned to the lowest two persons is 0.5. Next multiply 
by 3, the frequency in the next class interval, and enter 1.75, the per- 
centile score for the three persons scoring in the 130's; then by 3 again, 
obtaining 2.50 for the boundary between the second and third class 
intervals. This procedure is continued until the final check percentile 
is obtained. The percentile equivalent of the upper bound of the highest 
class interval must be 100.000 --- to as many decimal places as are being 
recorded. In Table 4 the percentile corresponding to the upper and 
lower bound of each class interval has been recorded (the upper bound of 
one class interval being identical with the lower bound of the next higher 
class interval). Sinée only the midpercentile is used, it is better proce- 
dure to record only the midpercentile and omit the upper and lower 
bounds. They were included to make the computational procedure 
clear. Also for a check on the number of revolutions that should be 
recorded in the calculating machine at each step, the last three columns 
of Table 4 are given. The speediest method of calculating percentiles is 
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TABLE 4 
COMPUTING Form FOR PERCENTILE SCORES 
Corresponding 
Cumu- Pp Multipliers 
Fre- | lative 
s quency| Fre- 

quency) Lower | Mid Upper | Lower | Mid | Upper 
120-129 2 2 0.00 | 0.50 1.00 0 2 4 
130-139 3 5 1.00| 1.75 2.50 4 7 10 
140-149 12 17 2.50 | 5.50 8.50 10 22 34 
150-159 23 40 8.50 | 14.25 | 20.00 34 57 80 
160-169 37 77 | 20.00 | 29.25 | 38.50 80 117 154 
170-179 51 128 | 38.50 | 51.25 64.00 154 205 256 
180-189 39 167 | 64.00 | 73.75 | 83.50 | 256 295 334 
190-199 21 188 | 83.50 | 88.75 | 94.00] 334 355 376 
200-209 9 197 | 94.00 | 96.25 | 98.50] 376 385 394 
210-219 2 199 | 98.50 | 99.00 | 99.50 | 394 396 398 
220-229 1 200 | 99.50 | 99.75 | 100.00 | 398 399 400 ' 

N=] 200 
100 100 — 0:95 
2N 400°" 


to follow the procedure indicated in Table 4, recording only the midper- 
centiles and the final check percentile of 100.00 ---: 

A routine for computing percentiles that gives a check on the number 
of revolutions in the machine at each step and records only midpercen- 
tiles is shown in Table 5. The columns X and f give scores and frequen- 
cies as before. A zero frequency is added for a hypothetical class inter- 
val below the lowest and above the highest. Column f gives the sums 
of adjacent entries in column f. The column labeled =f’ is a cumulative 
frequency of the f’ column. The entries in column Zf' are identical 
with those in the next to the last column in Table 4, except that the 
check multiplier of 400 (2N) has been added. 'The quantity 100/2N 
(0.25) is multiplied in turn by each of the entries in column Jf’, giving 
the percentiles in the column labeled p. These are identieal with the 
midpercentiles of Table 4, except that the final check percentile appears 
at the bottom. 

Regardless of the original shape of the distribution of gross scores, the 
distribution of percentile scores will be rectangular. Percentile scores 
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furnish a convenient method of indicating a person's standing relative 
to a specified group. Such scores are easy to explain to other persons, 
and are felt to be readily understood. Here, however, the advantages 
of percentile scores end. Such scores cannot legitimately be subjected 


TABLE 5 


COMPUTING Form ror PERCENTILE Scores 


x fi pf oM p 
0 0 0.00 
2 
120-129 2 2 0.50 
5 
130-139 3 7 1.75 
15 
140-149 12 22 5.50 
35 
150-159 23 57 14.25 
60 
160-169 37 117 29.25 
88 
170-170 51 205 51.25 
90 
180-189 39 295 73.75 
60 
190-199 21 355 88.75 
30 
200-209 9 385 96.25 
11 
210-219 2 396 99.00 
3 
220-229 1 399 99.75 
1 
0 400 100.00 
N = Xf = 200 
100 300. n5 
2N 400 


Enter 100/2N in the machine. Multiply cumulatively by the entries in f". The 
dial indicating number of revolutions will show successively the entries in Df’. 
Check: The last product should be unity to as many decimals as are being recorded. 


to the usual arithmetical operations. For example, if two tests are in- 
volved, and Mr. A has a percentile rating of 60 in one and 70 in the 
other, whereas Mr. B has ratings of 50 and 80, the procedure of averag- 
ing the percentiles would give 65 in both tests. Mr. B, however, 
probably would have a higher average if the original gross scores were 
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used. Just as average percentiles are misleading, so the correlation co- 
efficients found from using percentiles are different (usually smaller) 
from those found with gross scores. The amount of drop in correlation 
brought about by changing from gross scores to percentile scores in a 
normal distribution has been discussed by Karl Pearson (1907). Pear- 
son indicates that at most the correlation between normalized scores is 
.0180 greater than the correlation between percentile scores. 

From the illustrations given, we see that the maximum possible and 
the minimum possible percentile scores are functions of the size of the 
group taking the test. For a distribution of ten cases, these limits are 
95 and 5. For a distribution of a hundred cases, these limits are 99.5 
and 0.5. For distributions of a hundred cases or over, the effect of N 
on percentile scores can usually be ignored. However, normalized 
scores of the very high-scoring and the very low-scoring persons are 
markedly affected by the number of cases in the distribution and by 
slight differences in the extremes of the distribution. These effects will 
be illustrated in the discussion of normalized scores in section 8. 

The most striking defect of percentile scores appears, however, when 
we consider the problems of making norms comparable from group to 
group or test to test. Each percentile score is sensitive to any local 
change in its part of the distribution. Unlike the standard scores, the 
percentile score does not depend upon certain constants characteristic 
of the distribution as a whole. Standard scores, as indicated in equa- 
tion 2, depend upon only two parameters, a mean and a standard devia- 
tion. 

In the equating of percentile scores, no such simple parameters exist, 
Thus we see why it is that, with the growth of testing techniques, the 
percentile score has gradually been abandoned as a basic type of score, 
despite its seeming ease of interpretation. It is frequently convenient, 
however, to supplement linear derived scores with a table of percentiles 


expressed with reference to some specified group to aid in the initial 
interpretation of these scores. 


8. Non-linear transformations—normalized scores 


Since the normal distribution has many convenie: 
since many distributions have been found to be n 
distributions, another type of score is used in which 
tribution has been distorted from its original sha 
distribution. 

After percentile scores are obtained, the normal 
tained from tables of the normal curve. 
in the tables as x or z) value correspondin 


nt properties, and 
ormal or Gaussian 
the frequency dis- 
pe into a normal 


ized scores are ob- 
The base line (usually listed 
£ to each percentile is found. 


Chap. 19] Methods of Standardizing and Equating Scores 281 


Such a set of scores would range from —3 to +3, which is sometimes 
regarded as an undesirable score range. So again, as in the change 
from standard scores to the more general linear derived scores, we may 
multiply the normalized scores by any suitable value to give a standard 
deviation greater than unity, and we may add any suitable value to 
avoid negative scores. 

Like percentile scores, the normalized scores do not duplicate the 
properties of the original gross score distribution. Regardless of the 
skewness and kurtosis of the original distribution, the skewness of the 
normalized scores will be zero and the kurtosis three. However, the 
usual arithmetic operations with scores, such as averaging and calculat- - 
ing correlations, are probably legitimate operations to perform with 
normalized scores, as they are not with percentile scores. The problem 
of comparability from test to test and group to group is more difficult 
with normalized than with standard or linear derived scores. Thurstone 
(1925 and 19275), however, has presented a method for dealing with 
this problem. Flanagan (1939b) has described the use of a system of 
normalized scores by the Cooperative Test Service. 

As in the case of percentile scores, the range of normalized scores 
varies with the number of cases in the distribution. With normalized 
scores this difference is very marked at the extremes of the distribution. 
For a distribution of 10 cases, the percentile score limits are 95 and 5. 
The corresponding normalized score limits are 71.64. For a distribu- 
tion of 100 cases, the percentile score limits of 99.5 and 0.5 correspond 
to normalized score limits of 2.58. Also slight differences in grouping 
in the extremes of a distribution, such as might be brought about by 
varying degrees of skewness or kurtosis, will have a very pronounced 
effect upon the extreme normalized scores. For example, in a distribu- 
tion of 200 cases, if one person makes the highest raw score his per- 
centile score is 99.75, and his normalized score is 2.81, as shown in 
Table 6. If five persons of the 200 tie for top score, the percentile score 
for this group is 98.75, which is only one point lower than the score 
obtained by the top one person. However, the normalized score equiv- 
alent is 2.24, or more than half a standard deviation less than 2.81—the 
normalized seore for the top ranking one. Such apparently slight 
differences in groupings can make very serious differences in reported 
test results. If normalized scores on different tests are to be compared, 
it is important to be sure that slight differences in groupings in extreme 
cases do not occur, and also to be certain that the groups are similar in 
size; otherwise the results reported for normalized scores will be influ- 
enced more by the size of the group and by slight differences in grouping 
in the extremes than by the abilities of the students. 
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The computation of normalized scores is illustrated in Table 6. First, 
we compute the percentile score equivalents as illustrated in Table 5. 
Then from a table of the normal curve we read the base line values 
(normalized scores) that correspond to the various areas under the 
curve (that is, the percentile scores). 


TABLE 6 


A WORKSHEET ror RECORDING NORMALIZED Scores 


x $ p n 
120-129 2 0.50 —2.58 
130-139 3 1.75 —2.11 
140-149 12 5.50 —1.60 
150-159 23 14.25 —1.07 
160-169 37 29.25 —0.55 
170-179 51 51.25 +0.03 
180-189 39 73.75 +0.64 
190-199 21 88.75 +1.21 
200-209 9 96.25 +1.78 
210-219 2 99.00 +2.33 
220-229 1 99.75 +2.81 


Columns X, f, and p are taken from Table 5. Column n gives the normalized score, 
n is read from a table of the normal curve by entering it with the value: 


sporl — p. 

The use of normalized scores is indicated if there is reason to believe 
that the ability measured by the test is normally distributed and that 
defects in the test make the distribution of gross scores non-normal. 
Normalized scores on different tests are not comparable unless the 
groups are of similar size, and the distribution of extreme scores is simi- 
lar in both distributions. 


9. Standardizing to indicate relation to a selected standard 
group—McCall’s T-score; Cooperative Test Sealed Scores 


In order to give a common reference point for various scores, it has 
been suggested that some standard group be chosen and carefully de- 
fined, and that then the scores of all individuals be referred to that 
group regardless of whether or not the individual is a member of that 
group. 

For example, McCall (1922) suggested 
general use in standardized tests be based 
dren. He suggested that the mean normalized score for 12-year-olds be 
called 50, and the standard deviation of the normalized scores for 12- 
year-olds be fixed at 10. He suggested that such scores might be called 
T-scores (in honor of Thorndike and Terman), and that all indivi 


‘that a normalized scale for 
on scores of 12-year-old chil- 


duals 


Chap. 19] Methods of Standardizing and Equating Scores 283 


might be scored on this scale regardless of whether or not they were 
12 years old. McCall suggests the use of normalized scores, but he does 
not explain how we are to find out what gross score corresponds to very 
extreme normalized scores, such, for example, as plus or minus five or 
six standard deviations away from the mean. This difficulty in extra- 
polation has been overcome in subsequent expositions of the T-score 
by making it a standardized score or a linear derived score with mean 
50 and standard deviation 10 based on a group of 12-year-old children. 
Although this change in McCall’s original idea (see Hull, 1928, pages 
166-171, for example) makes it possible to extend the scale somewhat 
farther than the range of the original group, it still is rather a meaning- 
less standardization to include in a test for 12-year-olds items suitable 
for first-grade children that will be answered correctly by all the 12-year- 
old group, and items suitable for college students that will be answered 
correctly by essentially none of the 12-year-old group. The only usable 
type of solution for such a problem seems to lie in the devising of methods 
for putting several different groups on the same scale. Thurstone’s 
absolute scaling methods and the Cooperative Test Service system of 
Sealed Scores illustrate such methods. 

Thorndike suggested that successive groups be normalized on the 
same scale. He suggested making allowance only for differences in the 
means of the various groups. If the normalized score of one group is 
designated by X and the normalized score of another group by Y, 
Thorndike’s method amounts to equating the groups by using only the 
assumption that X = Y + C. The score when related to the X-group 
will differ by a constant from the score related to the Y-group. He 
assumed that the means of the groups differed but that the different 
groups each had the same standard deviation. In using this method to 
standardize items, it was found that scale values of items varied syste- 
matieally from one group to another. Thurstone suggested that more 
freedom be allowed in equating the groups. His suggestion was that 
all the groups be assumed to be normally distributed on the same base 
line, but that it be assumed that the means and standard deviations of 
the different distributions might be different. Thurstone’s method of 
absolute scaling based on this assumption has been found to give con- 
sistent results in several instances in which it has been used. Gardner 
(1947), working with Rulon and Kelley at Harvard, has suggested that 
another degree of freedom be allowed in trying to match several different 
distributions to the same base line. He has assumed that the groups 
may differ in mean, in standard deviation, and in skewness. That is, 
the distributions need not have the zero skew characteristic of the nor- 
mal distribution. Gardner has used this method in analyzing score 
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distributions for tests given at various grade levels, and has found 
definite skewness differences from grade to grade. 

The Sealed Scores of the Cooperative Test Service are similar to the 
absolute sealing units Thurstone has suggested, in that the different 
groups used are assumed to be normally distributed with different 
means and standard deviations, on the same basic scale. The Coopera- 
tive Test Service Scaled Scores are based on the performance of a group 
of average white children in the United States at the completion of a 
particular course in a typical school with the usual instruction in that 
subject. Such a group is assumed to be normally distributed with a 
mean of 50 and a standard deviation of 10. It is clear that, in selecting 
cases for such a standardization, there must be a number of somewhat 
arbitrary decisions and assumptions. Thurstone made no suggestions 
regarding any arbitrary value for a mean and a standard deviation. He 
pointed out that the standard deviation of some selected group could be 
termed unity or ten, and that a zero point could be chosen three to five 
standard deviations below the mean of the lowest group. 

A system for normalizing several distributions on the same base line 
that is rigorous and complete with significance tests and confidence 
intervals has not yet been devised. The procedure described by Flan- 
agan (19395) is an iterative one and uses only the points corresponding 
to the median score of each of the distributions considered. Since 
Thurstone’s procedure is simple and direct, requiring no successive 
approximations, we shall describe it here. Flanagan (1939b) has de- 
scribed the Cooperative test procedure and has worked out an illustra- 
tive example with both his own and Thurstone's method. 

In his bulletin on Scaled Scores, Flanagan (19395) indicates that it 
was Kelley who suggested 50 as the mean for the average child, subject: 
to the average training. The concept was developed in connection with 
Kelley's unpublished Universal Grading System. 


10. Thurstone’s absolute scaling methods for gross scores 


Thurstone’s absolute scaling procedure as applied to test scores 
involves the following steps. 


1. Give the test to two or more groups, so that there will be a marked 
overlap in the distributions of adjacent groups. We shall illustrate 
with two such groups, a and b. 

2. Select ten or twenty gross score points (X;), so that percentile 
scores (and hence normalized scores) can be determined for both 
groups a and b. 

3. Determine the normalized scores (Y;a and Y) for groups a and b 
corresponding to each of the selected gross score points X " 
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4. Plot Y;a on the ordinate against Y; on the abscissa for these 
selected points. 

5. If the two groups ean each be normalized on the same base line, 
this plot will be linear. 


In order to show this, let us assume a basic scale of values V;, in 
terms of which both groups are measured. If Mya, and sy, designate 
the mean and the standard deviation of the a-group in these standard 
units, any given score V; may be expressed in terms of this mean and 
standard deviation as 
(7) V; = Mya +. Y ia8va; 
where Y;a, the normalized score with respect to the a-distribution, indi- 
cates the number of standard deviation steps between the mean (Mya) 
and the score (Vj). Such an equation, with a different value of V; and 
Yia, and the same value of Mya and sya, applies for each of the gross 
score points selected for the comparison. Similarly, each of these points 
may be referred to distribution b instead of distribution a, and another 
set of equations written. These equations are 


(8) Vi = Myo + Yusvo- . 
Equating these for successive values of Vj, we have 
(9) Mya + YiaSva = Myo + Yasyo, 


where Mya and My» are the means in hypothetical absolute units for 
distributions a and b, 
sva and sy» are the standard deviations in hypothetical abso- 
lute units for distributions a and b, and 
Yin and Y are the normalized scores for distributions a and b, 
respectively. 


This fundamental equation as applied to test scores is given by 
Thurstone (1938). Since the M’s and s’s are constant regardless of the 
varying values of the Y’s, we have a linear relationship between Y iq 


and Y; that may be written 


Pa AN Mvi -— Mya 
raal =) Ya FA 
(10) Yia (=) : rem 


That is, if it is possible to normalize both the a and the b distributions 
on the same base line, by assuming only different means and standard 
deviations, the plot of the normalized scores (Y;a) against the normal- 
ized scores (Ya) will be linear with a slope equal to the ratio of the 
Standard deviations, and an intercept equal to the difference of means 
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divided by one of the standard deviations. If one mean and one stand- 
ard deviation are known, it is possible to solve for the other mean and 
standard deviation. Thus, by assuming a mean and standard devia- 
tion for the a group, the mean and standard deviation of the b group 
can be computed. Then this computed mean and standard deviation 
for the b group can be used in the b-c comparison to solve for the mean 
and the standard deviation of c. 

It must be noted, however, that the entire process of equating the 
scores of the various distributions is dependent upon the linearity of 
the plot Y;a against Y;, If this plot deviates markedly from the 
linear, we must conclude that both these groups cannot be normal on 
the same base line. The equating procedure is indicated to be impos- 
sible on these assumptions, and cannot be carried out legitimately. At 
present we do nof have significance tests to determine when the pro- 
cedure is legitimate, and when not. It is necessary to use judgment 
regarding the seriousness of deviation from linearity until significance 
tests are developed. 

Thurstone (1938) has applied this method and has shown that a 
national distribution of 40,229 A.C.E. scores, the distribution of 646 
University of Chicago freshmen, and of 113 test subjects volunteering 
for the primary mental abilities battery may all be regarded as normal 
on the same base line. 

The absolute scaling technique makes it possible to plot the frequency 
distributions of many different groups on the same base line. With 
units so established, it would be possible to indicate something about 
the nature of the mental growth curve for different types of mental 
functions. Thurstone has applied such scaling methods to Binet test 
items from different ages, and he finds a mental growth curve that is 
slightly negatively accelerated although it is still rising rapidly at age 
14; see Thurstone (1925). He also applied the same absolute scaling 
method to the completion test data collected by Trabue (Thurstone, 
19270), and he found a growth curve with only a very slight negative 
acceleration that was still rising very rapidly at grade 12. 


11. Standardizing to indicate age or grade placement 


One of the methods currently much in use for Scaling of test scores is 
to express the results in terms of the subject’s standing with respect to 
one of several possible standard groups. Mental Age units and Educa- 
tional Age units are examples of this type of scaling. In the case of 
Mental Age units, the individual is given a score that represents the 
“age group to which he belongs on account of his test score.” Simi- 
larly, the Educational Age units are used to indicate the grade group 
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that the individual resembles. Thurstone (1926) has shown the unsat- 
isfactory nature of the Mental Age unit as well as the ambiguity of 
definition of such a unit. 

An Educational Age of 8, for example, may be assigned to the average 
score made by all eighth-grade students; or it may be assigned to a 
score X;, which is selected so that the eighth grade is the average grade 
of persons who make that score. These two definitions will not lead to 
the same set of norms. The first definition corresponds to the regres- 
sion of test score on grade placement, and the second corresponds to the 
regression of grade placement on test score. 

In order to interpret such age or grade norms, we must know which 
regression line has been used, and we must also know the amount of 
variation about that regression line. For example, suppose that grade 
norms are established on the basis of the regression of score on grade. 
Then a person who has a grade placement score of 8, for example, has 

age of scores made by a representative 


made a score equal to the avers 
group of eighth-grade students. Suppose we know that this student is 


in the sixth grade; it is then possible to say that he is two years advanced, 
in the sense that if he were put with a group of eighth graders he would 
score at the average of that group. However, we do not learn from such 
information alone how usual or unusual such a performance is. If such 
a student is a 95 percentile on sixth-grade norms, we know that only 
5 per cent of sixth-grade students are two years or more advanced. 
However, if this point is an 80 percentile on the sixth-grade norms, we 
~ know that there is a great deal of overlap between the successive grades, 
so much in fact that 20 per cent of students in the sixth grade are at or 
above the score made by the average eighth grader. 

The same type of remarks apply to any other set of norms based on 
successive groups, whether they are age groups, grade groups, height 
groups, or some other type of grouping. To know that a person is two 
or three years advanced or retarded in a given characteristic becomes 
much more meaningful if we also know the percentage of his group that 
is advanced or retarded an equal or greater amount. 

Similar considerations apply if the other regression line is used. Sup- 
pose, for example, that we are using the regression of chronological age 
on test score. Then the age equivalent would be the average age of 
persons making the same score. Suppose that the average age of per- 
sons making a given score is ten, and that the student: whose score is 
being interpreted is eight years old. We know that he is with a group 
that is on the average two years older than he is. Again if only 5 per 
cent of the students making that score are under eight years of age, we 
are dealing with a relatively unusual degree of advancement. How- 
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i "25 dents making that score are under eight, 
€ E j a a Rasen more sic den degree of advancement. 
rue rales the difference between these two modes of proce- 
Ea A is the regression of score on age—the average pim a 
by persons of each chronological age. According to this line a score of X 
is equivalent to an age level of b years, whereas a score of Xs E e 
alent to an age level of c years. On the other hand, thg line a he 
regression of age on score—the average age of those persons making a 
given score. According to this line, the age level corresponding to a 


Fiaure 1. 


Illustrating the difference between the regression of 
age on score. 


Score on age, and 


score of X, is a years, and the age level correspondin 
is b years. It will be noticed that, for all 
regression of age on score will give lower a 
score level than will the regression of scor 
the mean, the regression of age on score 
than does the regression of score 
“age difference” corresponding to any tw 
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It is interesting to note that, if the regression of age on score is used, 
as tests become more unreliable children above the average will be re- 
ported as less advanced than they are, and children below the average 
will be reported as less retarded than they are.” If the regression of 
score on age is used, children who are above average are reported as 
very remarkably advanced, and children below average are reported as 
very markedly retarded. 

This effect is demonstrated in Figure 2. Line A is the regression of 
test score on age, and line B the regression of age on test score for a very 


Age 


Test score 


Ficure 2. Showing the effect of test unreliability and regression line used, upon 
norms. 


reliable test that correlates highly with age. Since lines A and B are 
close together, it makes little difference which regression line is used 
when the correlation between score and age is high. If the test is short- 
ened and becomes unreliable, line A will tend to move into a position 
Such as line a, and B will move toward line b. Line a then represents 
the regression of score on age for a relatively unreliable test, one that 
does not correlate very high with age. Line b represents the regression 


of age on score for such a test. 
By taking any illustrative score level above the mean, we see that 
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such a child will appear more advanced if the regression of score on age 
is used, and less advanced if the regression of age on score is used. Ina 
similar manner, for any score less than average, the unreliable test will 
minimize the degree of retardation if the regression of age on score is 
used, and exaggerate it if the regression of Score on age is used. The 
only method of taking account of such effects is to report the error of 
measurement of the test and the variability about the regression line 
which is used. Once this is done, the difference between the test repre- 
sented by lines A and B and the unreliable test represented by the dotted 
lines a and b becomes apparent in the norms. 

Whenever a test is reported in terms of many different standard 
groups, as in the case of age norms or grade norms, it is essential to 


know: 
1. Which regression line is used. 


2. The variability of the standardization 
line. 


3. The error of measurement of the test. 


group about that regression 


Unless we have this information it is im: 
to which two or three ye 
marked deviation from n 
in the test. 

To illustrate the same sort of logie w 
norms, we may point out that suck 
give the regression of weight on hei 
find your height, and then note th 
height. Such information is of v. 
pounds it is necessary to gain or lo 
your height. Since it is not as eas 
showing the average height for pe 
useful item of information. 


possible to estimate the degree 
ars retardation or advancement indicates a 
ormal performance or à marked unreliability 


ith conventional height-weight 
1 norms are usually constructed to 
ght. That is, to use the norms, first 
€ average weight for persons of your 


alue in that it tells you how many 
se in order to be of ay 


Ww usual or unu 
is for your height. Some percentile tables w 
to each person that he was within the wei 
persons of his height, or was heavier or li 

of persons of his height. Such added infor 


Age was divided by 


tient, known as the Intelligence 


—~ - — 
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Quotient or the I.Q. Similarly, the grade placement indicated by the 
test score was called Educational Age; and the Educational Age was 
divided by chronological age to obtain an educational quotient or E.Q. 
It was also suggested that one of these quotients be divided by the other 
to determine an accomplishment quotient or A.Q. 

Since we need the error of estimate and the error of measurement in 
order to make any reasonable interpretation of norms such as Mental 
Age or Educational Age, it would seem clear that further routine divi- 
sion would only make the scores more and more difficult to interpret. 
As Thurstone (1926) has pointed out, the best procedure is to abandon 
the various quotient type units, as well as the Mental or Educational 
Age units, and to use normalized or standard score type units referring a 
given case to several different sets of norms if necessary. We could then 
say that this eight-year-old child has a percentile score of 92 on eight- 
year norms, and one of 50 on the eleven-year norms. Such a system 
would reflect both the typicality or atypicality of the child, and the 
rate of advancement of the group in the trait or skill in question. 

It should also be noted that the relationship between different norms 
is changed by social customs. For example, the relationship between 
age and grade norms is affected by changes in the educational customs 
regarding promotion from grade to grade. In the early 1900's promotion 
was based primarily on achievement. The pupil who did not learn as 
rapidly as the average was not promoted. Such an educational system 
would give rise to a marked difference between age and grade norms, 
and also lead to a smaller dispersion of scores within each grade, accom- 
panied by less overlap in the scores of adjacent grades. The present 
custom of promoting a pupil primarily on the basis of age will increase 
the resemblance between age and grade norms, increase the dispersion 
of scores within a given grade, and produce a marked overlap in the 
Scores of adjacent grades. Norms that were determined under the 
former system of promotion cannot be compared with norms estab- 
lished under the present system of promotion primarily on the basis of 
age. Similarly, norms that have been established under limited educa- 
tional opportunities, and when the illiteracy rate is high, cannot be 
expected to resemble norms established when the educational level of 
the population is increased, and the illiteracy rate is low. 

12. Standardizing to predict criterion performance 

If we are dealing with a situation where predicting a criterion per- 
formance is desired, the proper regression line is readily indicated. For 
this purpose the regression of criterion score on test score is the correct 
one to use and will give the best predictions in the sense that this line 
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iteri ined by the persons making each 
i verage criterion score obtaine he p : 
d ipis The regression of test on criterion score will syste- 
g À ; 
iy give overpredictions of the performance of those scoring 
CLAN the mean, and underpredictions of the performance of those 

ring below the mean. 

E Es either regression line, we must note the possible effects ofa 
change sun population. For example, if we have established a re- 


Criterion 


gression 
variable related to crite 


line of selection of the group on some 
tion performance, 


gression line and a cutting score 


at level 
the high-level applicants are 


necessary to raise the cutting 
versely, if a depression throws a 
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if the old cutting score (X;) were maintained, it would be expected that 
the quality of work obtained from the selected applicants would be 
increased. 

In summary, then, if we are using test scores to predict a criterion 
score, and wish to set a cutting score such that the average criterion 
score of those at the cutting score will be some fixed value, then if there 
is a shortage of high scoring and a surplus of low scoring persons, it is 
necessary to raise the cutting score. Whereas, if there is a surplus of 
high-scoring persons, or a shortage of low-scoring persons, it is necessary 
to lower the cutting score in order to have the average criterion per- 
formance of those at the cutting score remain at a specified level. That 
is, the adjustment required to maintain a given level of performance at 
the cutting score is the opposite of what we should wish. 

In part the decision for the use of the regression of x on y or of y on x 
may depend upon which variable is made the basis of selection of cases 
for the standardization group. For example, if there is reason to be- 
lieve that we have a representative sample of eight-year-old children 
or nine-year-old, ten-year-old, etc., we might use the regression of score 
on age and expect the regression of score on age found for that group to 
be duplicated in future samples. The regression of age on score (aver- 
age age of those making a given score) is indicated if we feel that the 
sample drawn is representative of all ages making a given score. That 
is, if there have been no influences at work that would select with respect 
to age of the population, the regression of age on score is indieated. 

If we wish to use the regression of criterion on test score, the group 
may be selected explicitly on the basis of test score without biassing the 
regression line, but within the test score range selected there must be no 
Selection on the basis of criterion score. For example, if workers who 
do not show a certain minimum production record are dismissed, and 
hence not included in the standardization group, the regression of 
Criterion on test score will not be correct. We may select on the basis 
of the independent variable without biassing the regression of the de- 
pendent on the independent variable. "There must be no selection on 
the basis of scores on the dependent variable or the regression line will 


not be correct. 


13. Marginal performance as a guide in determining cutting 
Score 
In determining cutting score or in deciding on possible changes in a 
Cutting score, it is sometimes helpful to consider the performance of the 
"marginal" student—the student immediately above or below the 
proposed cutting score. Figure 4 illustrates a correlation scatter plot 
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i T i f criterion on test 
iteri versus test score, with the regression o! n te 
7 2 ax 4 Y — aX), ‘and a critical level (L) for the criterion 
ie Let us consider any given test score array (C’) as illustrated in 
score. 


Regression of criterion 
on test score Dy, 


Array C' EN 


Criterion, Y 


Critical 
criterion level v, 


3 4 5 


6 7 8 9 10 fi 132 13 
Test score, X 

FicunE 4. Critical criterion level and regression of criterion on test score. 
Figure 4. The mean er 


iterion score of this array is on the regression 
line, and it may be written as 


aX;+ Y — aX. 
The standard deviation of thi 


8 array—of any array. is the standard 
error of estimate, 
Eu 
8yV1— fag 
For any given array, the critical criterion level I, may be written as 
the deviation score, 


L= aX; — Y + aX. 
Or, written as a standard score, we have 
i " Viene , 
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where zz, is the deviation of the cutting score (L) from the mean of 
array (i), using the standard deviation of the array or the error of esti- 
mate as the standard unit. 

This quantity zz, may be computed for each of the possible test score 
values (X;). These values may be converted into percentages by the 
use of a table of the normal curve. This series of percentages will show 
the percentage of persons that will be above (or below) the critical 
criterion level for each test score (X;). Figure 5 is such a graph, show- 
ing p, the percentage above the critical criterion score for each value of 
X, A cutting score just below F would mean that the lowest persons 
accepted would have a 50-50 chance of being above the eritical criterion 
level (see F in Figure 4 or 5). As the cutting score is moved below this 
point, the lowest persons accepted have a better than even chanee of 
being below the critical criterion level. If the cutting score is fixed at 
a level considerably above F, persons with a better than even chance of 
being above the critical criterion level are being rejected. The decision 
to move the cutting score away from the point F depends on judging 
either that the need for additional persons is sufficiently urgent to justify 
accepting those who are more likely to fail than to qualify, or that we 
can afford to reject a group that has a better than even chance of suc- 
cess in order to reduce the total number of failures. 


8 9 10 1 12 13 


a 4 5 6 7 
Oe Test score, X 


Fiaurn 5. Percentage above critical criterion level as a function of test score. 


This approach to selecting 2 cutting score can be quantified if it is 
possible to determine or estimate the ratio of the two quantities: H, the 
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cost of selecting a person who will fail, and G, the net gain from va 
ing a person who will qualify. The cutting score should be at the v , 
where for the marginal group the total gain from the successes wi 
equal the total cost from the failures, or where 


pG = (1— pH. 
Solving for p, we find that ü 


GFHE 


For example, if G = H, then, according to this equation, p = 14. In 
such a case, the cutting score would be at the point F in Figures 4 and 5, 
where the probability of success or failure is l6. If the ratio R = H/G 
can be approximated, we have a solution as 


R 


R+1 


p 


14. Equating two forms of a test by giving them to the same 
group 


Tt is usually thought that when two forms of a test are given to the 
Same group, no special equating problems arise. The procedure is to 
convert each form directly to standard, normalized, or percentile scores, 
and to assume that such scores are comparable since they were obtained 
on the same group for both forms of the test. It should be noted how- 
ever that a conversion to standard score makes adjustments only for 
differences in the mean and the standard deviation of the two forms. 
If the skewness or the kurtosis of the two forms differs, this difference 
will be reflected in the Standard scores and will also be reflected in most 
cases in percentile or in normalized scores. 

For example, if a distribu 
will be some very low 
high scores. This will 
it will be true of stan 


tion of scores is ne 


gatively skewed, there 
Scores, but 


there will not be corresponding very 
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We shall get a few Scores that 
es that are corres 


This effect of grow 
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considering the five to 
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subjects make different scores, they will have percentile scores of 99.75, 
99.25, 98.75, 98.25, and 97.75. If all these subjects make the same 
score, all of them will get a percentile score of 98.75. Such a test cannot 
discriminate between the 97.75 and the 99.75 performance. There is no 
opportunity for even the best person to score higher than 98.75. With 
normalized scores this difference is even greater, since in the first in- 
stance the highest possible percentile score of 99.75 corresponds to a 
highest possible normalized score of 2.81, whereas, in the second case, 
the highest possible percentile score of 98.75 corresponds to a normalized 
score of 2.24. This is a normalized score difference of 0.57. In the cen- 
tral part of the distribution it would take a very much larger percentile 
difference to correspond to a normalized score difference of 0.57. For 
example, a percentile score of 50.00 corresponds to a normalized score 
of 0.00, and a percentile score of 71.57 to a normalized score of 0.57. 
The difference between having one or five persons grouped at the highest 
score changes the highest possible normalized score by as much as the 
difference between the fiftieth and the seventy-first percentile. 

If two tests have skewness and/or kurtosis coefficients that are 
radically different, it is difficult to define the meaning of parallel scores 
on the two tests. No set of standard, linear derived scores, percentile, 
or normalized scores will be parallel. f 

As yet there is no statistical test available for equality of skewness 
and kurtosis. However, by inspecting the cases at the extremes of the 
distributions involved, it is possible to compare the highest possible and 
the lowest possible scores in two distributions, and to judge whether or 
not the difference is serious in terms of the decisions that are being 
made on the basis of the scores. In particular it is necessary to be care- 
ful when special action is being taken on the basis of extremely high or 
extremely low scores, as, for instance, if the best st udent is awarded a 
scholarship or if a few especially low students are dismissed. 

If three or more parallel forms are being standardized on the same 
group of persons, Wilks’ test for equality of variances and covariances 
given in Chapter 14 may be used. If the tests are not homogeneous with 


respect to covariances, no adjustment of norms can make the forms 


Parallel in this respect. sa 
A second type of case arises if we are standardizing two forms of a 


test on the same group, and a criterion, which the test is to predict, is 
also available. In such a case, we may define the problem of equating 
test scores as matching the regression line of criterion on test for the 
two tests. Let us use the subscript ¢ to designate the criterion, and x 
and y to designate the two tests. Then the problem of equating x and y 
for the purpose of predicting criterion ¢ may be stated as follows. 
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Step 1. Check to see that frye = r. or that [ei F Tye )se = 
(1 — rzZ)s2. This check may be made on an approximate judgmental 
basis or by using the extension of Wilks’ criterion given by Votaw 
(1947) or (1948). If the criterion correlations of the two tests (or what 
is the same thing, the criterion errors of estimate) are essentially the 
same, it is possible to equate scores on the two tests, v and y. If the 
criterion correlations are different, equating scores for the purpose of 
predicting the criterion is not possible. 

Step 2. Express both x and y in terms of standard scores or linear 
derived scores, using the same mean and standard deviation. If there 
are slight differences in the criterion correlations of x and y, which may 
be attributed to sampling errors, it is possible to match the regression 
lines exactly by setting the mean of x equal to the mean of 2 
the ratio of the standard deviations equal to the ratio of 
correlations (that is, s,/s, = Tia) Tye): 

Kelley (1947), pages 364-365, describes a method 
norms for a new test (Xo) in terms of an “anchor test” 
on the use of the regression of Xo on X.. 
termine equivalent scores for two parallel 


bias will result. As compared with the anchor test, the new test will 
have a smaller unit of measurement 


, and hence the numerical value of 
its standard deviation will necessarily be larger than that of the anchor 
test. 


In summary, when two forms o 
converting each form directly to 
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15. Equating two forms of a test given to different groups 

A more complex and also more usual case of equating arises when 
form Y (given to group A) is to be equated to form Z (given to group B). 
This is usually done by means of another test or segment of a test that 
will be designated X, which is administered to both groups. The theory 
for equating test Y given to group A with test Z given to group B by 
means of test X given to both groups has been developed by Ledyard 
Tucker (unpublished manuscript) for the case of standardized or linear 
derived scores. 

The equating “test” X mentioned above may be a single test or sub- 
test, yielding only one score, or it may be that several equating variables 
(Xa g =1-++ K) will be available. We shall first consider the case 
where only one equating variable is available (g = 1), and then the more 
general ease where K equating variables are available. 

Since standard scores or linear derived scores are dependent entirely 
on mean and variance, the problem may be stated as that of estimating 
the mean and variance of test Y for group B. This mean and variance 
would then be arbitrarily assigned to the new test Z that was given to 
group B. The Z-norms would thus be comparable to the Y-norms in 
the sense that the mean and variance of the transformed Z-scores for 
group B would be the same as they would have been if test Y had been 
used on group B. 

Let us use a subscript set in roman type to designate the group, a 
bar over the variable to designate the mean, and a wavy line to desig- 
nate the standard deviation. In this notation, we may say that the 
problem is to estimate Y and Fs (the mean and standard deviation of Y 
for group B) from the known items of information, Xs, Xa, Xp Xz, 
Ya, and Y4 (the mean and standard deviation of Y for group A and of 
X for groups A and B). m 

Making use of the equation of the regression line and the deviation 


Score notation, we may say that 
(11) yi = axi + €i. 


The score of the ith person on test y is equal to a times his deviation 
score on test 2 plus an error, cj. We may change to gross scores by 
substituting Y; — Y for yi, X; — X for x; and write equation 11 ex- 
plicitly for each of the groups A and B, as follows: 


(12) Ya; = aAXai + Ya — aAXa + eu 
and 
(13) Yp: = anXni + Yn — anXs + epi. 


" 2d 
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Since complete information is available on group A, equation 12 pre- 
sents no problem. However, for group B only the X-scores are avail- 
able; hence some estimates must be made regarding the constants in 
equation 13. It seems reasonable to assume that the slope and intercept 
of the regression of Y on X for group B are equal, respectively, to the 
slope and intercept of the same regression for group A, that is, 


(14) GA = OB 
and 
(15) Ya — aAX4 = Yp — apXp. 


Summing equation 13 and dividing by Ng to obtain the mean of Yp; 
we have 


(16) Ys = apXp + Yn — apXp + ép. 

If we assume that 

(17) & — & = 0, 

and substitute equations 14, 15, and 17 in equation 16, we obtain 
(18) Ys = Ya + aa(Xp — X4). 


Equation 18 expresses the Y-mean for group B in terms of known 
quantities. 


The value Yp given by equation 18 is the arbitrary mean to 
be assigned to the B-group in order to have the scores com- 
parable with the Y-scores of the A-growp. This is the value 
to be used as M ,, in equation 6. Equation 18 is derived from 
assumptions that are the same as those used in the 


A. equations 
for group heterogeneity in Chapters 10 to 13. 


al P aris " y. 7 ns "T 2. 
To obtain the variance of Yp;, we write equation 13 in deviation score 


form as in equation 11, and take the sum of the Squares of the devia- 
tions over Np, obtaining 


N N 
2 3 
2 wm? — 27 (ayes; + ep)? 
= i: 
(19) COE ee 
N N 
Since the correlation between x and ¢ is zero, Exe is zero 
- . š p A 5 ero. 
the 2 Side of the equation, and writing. Y 
H p a 
and E? for the error variance, we obtain 


À Expanding 
B“ for the variance of 
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N N 
ag? 2 vp? > eae 
i=1 =1 


(20) Y 2 i t 
et 

or 

(21) Yu? = an? Xn? + £y. 


Likewise, for the A group we have 
(22) Yq? = a? X4? + EA. 
From equation 22 we can solve for Z4? in terms of Fa and Ña. If we 


make the assumption that 


(23) E, = By, 
we may write Y? entirely in terms of known quantities as 
(24) Yy? = Ya? + aa’ (Xe — X47). 


Zquation 24 expresses the y-variance for group B in terms of known 
quantities. 


The value Yy given by equation 24 is the arbitrary standard 
deviation to be assigned to the B-group in order to have the 
scores comparable with the Y-scores of the A-group. This is 
the value to be used as Sw in equation 6. Equation 24 is 
derived from assumptions that are the same as those used in 
the equations for group heterogeneity in Chapters 10 to 18. 


Equation 24 is identical with equation 20 of Chapter 11. 

If K equating variables are available, the derivation of the Y mean 
and variance for group B follows the same general pattern, except that 
a multiple regression equation is used instead of the regression line of 
equation 11. To correspond to equation 11 for the multiple-regression 
case, we write 

K 
(25) yi = b» Ogtig + ei 
g=l 
From equation 25, the equations corresponding to equations 12 and 13 
are written as 


K K 
(26) Yar = do axeXave 1 YA = 25 GugXa-¢ eas 
and = £e 
K os. K e 
(27) Ypg; = ds apgXBig + Yi 2 GpgXp.g + €Bi- 
x g—l 


g—l 
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To obtain the mean Yg, sum equation 27 and divide by Ng, obtaining 


K 
GpgXp.g — 2, GBeXnp.g + Yn + Gp. 


g=1 


Ma 


(28) Ys = 


& 


ll 


To correspond to equations 14 and 15, we assume that the regression 
coefficients, ag, for group B are equal to those for group A, and that the 
constant term for group A is equal to that for group B. These assump- 
tions give the K + 1 equalities, 


(29) Ag = pg (g =1---K) 
and 

K E _ K = 
(30) Ya — 25 akaa = Yn — > an; Xnp.,. 

gl g—l 


Substituting equations 17, 29, and 30 in equation 28 gives 


K 
(31) Ys = Ya + D> oag(Xp-g — X4.,). 


g=1 


For the general case of K equating variables, equation 31 
gives the Y-mean for group B. This is the value to be used 
for My in equation 6. The derivation uses the same as- 
sumptions as those of Chapters 10 to 18. 


In order to obtain the Y-variance for group B, write equation 27 in 
deviation score form as 


K 
(32) Ys; — Ye = D ane(Xpig — Xp.5) + egi 


B1 


Using the lower-case symbols to designate deviation scores gives 


(33) 


K 
Vni = 27 angtpig + eni 
g=1 
To obtain N times the variance of y, square both sides of equation 33 
N , 
and sum. Noting that all terms of the form Dd tiei 


are zero, we may 
t=1 


write 


N N 
(34) Zs- D| 
i=l 


i-i Lg= 


K 2 N 
ante + 25 ep. 
1 i=l 


The first term on the right side of the equation may be expressed as a 
triple summation, and the order of summation may be altered, giving 
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N 


x X N N 
(35) 27 yn? E p» D 2 Ops; OBATBigUBiA + UR eng. 
ici 8—1 hal i1 ici 

Equation 35 may be simplified by the notation 

N 
(36) Ligtih = Negi. 

ici 
If g #h, cis a covariance. If g = h, c is a variance. For variables y 
and e, the sum of squares will be designated by NY? and NZ, respec- 
tively. Introducing these notational changes in equation 35 and divid- 


ing by N gives 
K K 
(37) Fe? = Y) DY anamenga + Ew. 
gal h=1 
Likewise, for group A, we have 
xk & 
D "d ji, 2 
(38) Ya? = Y DY angaancacn + EA, 


g=] h=1 


from which the variance of Z4 may be written as 


K K 
(39) Be = Y — DD aamacane 
g=1 h=1 
Using the assumption of equation 23, we may substitute equation 39 
in equation 37; then using the assumptions of equation 29 and simplify- 
ing the result, we find the solution 
K K 
(40) Fe? = Va? + DD aretan(caen — Cun); 
g=1 h=1 
where Fp? is the variance of variable Y for group D, 
PA? is the variance of variable Y for group A, 
aag is the regression weight for variable X, in predieting Y in 


group A, and x. 


caen is the covariance (1/N) Do his Thi 
ici 
For the general case of K equating variables, equation 40 
gives the Y-variance for group B. This is the value to be 
used for Sw in equation 6. The derivation uses the same as- 
sumptions as those of Chapters 10 to 18. 
The problem discussed in this section is equating test Z, given to 


group B, to test, Y, given to group A by means of an equating test X or 
a set of K equating tests Xg- The solution is to estimate the mean of Y 
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for group B by equations 18 or 31; and to estimate the variance of Y for 
group B by equations 24 or 40. These estimated values are then used 
as the arbitrary mean and variance (Mw and s?) to be assigned to the 
new test Z (see equations 5 and 6). This equating procedure is appro- 
priate for any linear derived scores. 

For non-linear transformations of gross scores no appropriate proce- 
dure has yet been suggested. For percentile scores, it may well be 
impossible to develop a rigorous equating method. For normalized 
scores, it may be that some adaptation of Thurstone’s absolute Scaling 
methods will give a satisfactory solution. These normalized scores 
might then furnish a satisfactory basis for equating percentile scores. 
If and when a solution for the equating problem is developed for normal- 
ized and percentile scores, it is highly likely that the sampling errors in- 
volved will be very much greater than those found in equating on the 
basis of linear derived scores. If the magnitude of sampling errors in- 
volved in equating tests from one group to another are considered, it 
seems likely that linear derived scores have a distinct advantage over 
the non-linear transformations. 


16. Summary 


After the test papers have been scored, the next step is to assess the 


gross score distribution in terms of the average chance score K/A, and 
the variability of chance scores 


a) ‘im (2) VKA), 


where K is the number of items in the test and 
natives per item. 


the subject should 
Whenever the d 


A is the number of alter- 
The lowest score taken as indicating knowledge of 
be greater than K/A + 2s.. 
istribution is divided into groups, the distance from 


o the upper bound of a group should be large with re- 
spect to the error of measurement of the test. 


It should be noted that, whenever several tests are used, the principle 
of successive hurdles makes for raising of passing standards, whereas 
ppitin multiple trials lowers standards, particularly with unreliable 
ests. 
In converting gross scores 
grades, Civil Service ratin 


grams, the procedure is to 
r conver 
gross scores to some predetermined scal DUM 


e that indicates the relative 
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standing of the individual in his group, and to report scores in terms of 
this scale. The transformations usually considered are: 


1. Linear transformations, termed standard scores, or linear derived 
scores. 

2. Non-linear transformations, of which the commonest are the 
percentile score, and the normalized score. 


Percentile scores represent the percentage of persons in a typical 
group scoring less than the person in question. Such scores are very 
easy to explain to persons unacquainted with testing, but they have so 
many disadvantages that percentile scores are not generally used except 
as auxiliary scores. Percentile differences or averages are not constant 
in meaning from the middle to the extreme of the scale, and the equating 
of different groups is difficult if not impossible. Normalized scores 
should in general not be used unless there is some good reason for be- 
lieving that the underlying distribution of ability is normal and is mis- 
represented by the distribution of gross scores. Thurstone’s Absolute 
Scaling Methods furnish one way of checking on this belief for two 
partly overlapping groups given the same test. The range of possible 
normalized scores is also sensitive to the number of cases in the group, 
and to the grouping of the extreme cases. This effect must be watched 
carefully if comparisons are to be made from group to group or from 
test to test. 

The various disadvantages of percentile and normalized scores has 
led to the general use of some linear transformation of gross scores, with 
a convenient scale specified by an arbitrary mean and standard devia- 
tion. For standard scores the computation equation is 


1 m Mx 
w Nr 
Sx sx 


z; is the standard score of the 7th individual, 

X; is the gross score of the same individual, and 

M x and sx are the mean and standard deviation of the gross score 
distribution. 

inear derived scores the computing equation is 


where 


For other types of | 


w Sw 
(6) "T (5) Xi Mo — (3) Mx, 
Sr Sx 


where wi is the linear derived score of individual 7, and 
My and sw are the arbitrarily specified mean and standard deviation 
of the linear derived scores. 


The other terms have the definitions given for equation 4. 
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The use of normalized scores referred to some standard group has 
been suggested by McCall (the T-score), by Flanagan (the Cooperative 
Test Scaled Scores), and by Thurstone (absolute scaling methods). 
Thurstone gives the fundamental scaling equation as 


(9) Mya + YiaSva = Mys + Yusys 
or : 
Syb Mvi — Mya 
(10) Yia = ( ) Yap i 
SVa SVa 


where Mya and My, are the means in hypothetical absolute units for 
distributions a and b, 
Sva and sy; are the standard deviations in hypothetical abso- 
lute units for distributions a and b, and 
Yia and Y ;; are the normalized scores for distributions a and b, 
respectively. 


Equation 10 demonstrates that the normalized scores for two distri- 
butions will be linearly related to each other, if both distributions can 
be regarded as normal on the same scale. 

In standardizing to predict a criterion performance it is necessary to 
use the regression of criterion on test score and to give the corresponding 
error of estimate in order to use the norms properly. 

If, in addition to the regression of criterion on test score, we have a 
specified critical criterion level, it is possible from these two items of 
information to draw a curve showing the percentage of persons (p) 
above the critical level at each test score range. This graph can be 
used for determining the cutting score. If the ratio of H , the cost of 
selecting a potential failure, to G, the net gain du 
ful person, can be determined or estimated, thi 
fixed at the test score level, where p = R/(R + 

Another type of standardization 


e to selecting a success- 
e cutting score can be 


e nature of the sampling 
It is also important to 
lation to the size of the 
Sas these, the degree of 
rack asily be markedly exag- 
gerated or minimized. 

If two forms of a test to be equated have been 


sin: : 1 given to the same 
group, it is possible to make two independent, trans 


formations to some 


— > o — — 
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linear or non-linear scores. Such scores for the two forms, however, 
will not be comparable unless the skewness and kurtosis coefficients for 
the gross score distributions are similar. Also if we are standardizing 
in terms of a criterion, two forms eannot be regarded as parallel unless 
the correlation with the criterion is approximately the same for the 
two forms. d 

In standardizing two forms of a test, each given to a separate group, 
it is necessary to use some form of linear derived scores and to have a 
matching test. When linear derived scores are used, the only problem 
is to determine an appropriate mean and standard deviation for the 
second group. For a single equating variable, we have 


(18) Ys = Ya + aa(Xp — Xa) 
and 
(24) Fe? = Ya? + aa? (Se — X47). 


For K equating variables X,(g = 1 +++ K), we have 


K 
(31) Ys = Ya + 22 aay(Xp-¢ — X4) 
g=1 
and 
K K 
(40) Vp? = Yi? DO 32 asta (ean. — Crgn), 
sal hal 
where Y, and YA? are the mean and variance of Y for group A. 


(These are the original seores to which the 
B-group scores are to be matched.) 
Xa, X4?, Xp, and Xy? are the mean and variance of X for groups A 

and B, respectively. (X is the matching 
test that has been given to both groups A 
and B. Also X, g =1--- K indicates K 
matching tests.) 

aag is the regression weight for variable X, in 
predicting Y in group A. 

N 

cagn is (1/N) 2 TAigtain (tig = Xig — Xa). (If 
g 7 h, this term is a covariance; if g = h, 
this term is a variance.) 
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TABLE FOR Use IN CONNECTION WITH PROBLEMS 3 TO 7 


following scores were made by 68,899 students in 323 colleges on the 1937 
wae cette beaten Council on Education Psychological Examination for 
College Freshmen. 


Frequency 

Scores “Men Women Total * 
0-9 2 4 
10-19 5 3 12 
20-29 27 22 58 
30-39 85 50 170 
40-49 169 112 329 
50-59 329 225 626 
60-69 471 358 943 
70-79 667 479 1,314 
80-89 923 769 1,915 
90-99 1,108 892 2,264 
100-109 1,387 1,171 2,897 
110-119 1,669 1,376 3,429 
120-129 1,768 1,529 3,764 
130-139 2,064 1,768 4,348 
140-149 2,113 1,793 4,471 
150-159 2,188 1,830 4,650 
160-169 2,220 1,748 4,600 
170-179 2,128 1,798 4,583 
180-189 1,990 1,610 4,207 
190-199 1,823 1,479 3,904 
200-209 1,639 1,351 3,593 
210-219 1,488 1,251 3,281 
220-229 1,234 996 2,686 
230-239 1,097 906 2,441 
240-249 893 748 2,025 
. 250-259 750 596 1,630 
260-269 584 488 1,309 
270-279 474 329 998 
280-289 358 273 772 
290-299 284 187 580 
d 300-309 184 122 387 
310-319 153 74 286 
320-320 96 52 181 
330-330 70 38 133 
340-349 29 13 51 

350-350 2 
360-360 E : f 
370-379 2 2 
380-380 1 5 
Total 32,500 26,450 68,899 
Lower quartile — 127.27 127.54 1 
Median ^ 165.75 164.84 is F M 
Upper quartile 207.57 206.10 208. 87 
M = 170.0214 
s= 57.7012 


* The total includes the scores of 9949 stude 


Data taken from L. L. Thurstone and T. um not classified according to sex. 


"ut - Thurstone, The 1 D. a 
Examination for College Fres 3 ; the 1937 Psychological 
DA or College Freshmen, The Educational Record, April, 1938, Discs 


— 
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Problems 


1. Draw a graph corresponding to the following transformation. The distribu- 
tion of Table 1 is to be linearly transformed to a scale 00-99, such that a gross score 
of 150 equals 70 (the lowest passing mark) and a gross score of 210 equals 90 (the 
lowest honors mark). 


2. Give the average chance score (č), the variance of scores due to chance (2), 
and the lowest gross score exceeding č + 2¢ for each of the following tests: 


(a) A 20-item true-false test. 

(b) A 100-item true-false test. 

(c) A 20-item multiple-choice test that has one correct and four incorrect alter- 
natives for each item. 

(d) A 100-item multiple-choice test, with 5 alternatives per item, as in c. 

(e) A 10-item test, with 10 alternatives for each item. 


3. Using only the total distribution given in the right-hand column of the fore- 
going table, compute the table, and draw the graph for transforming raw scores of the 
foregoing frequency distribution to (a) standard scores (z-scores), (b) linear derived 
scores, with a mean of 50 and a standard deviation of 10 (w-scores), (c) percentile 
Scores (p-scores), (d) normalized scores (n-scores). 


4. From the information in the preceding problem, draw the graphs showing the 
relationship between (a) z-scores and w-scores, (b) z-scores and p-scores, (c) z-scores 
and n-scores, (d) w-scores and p-scores, (e) w-scores and n-scores, (f) p-scores and 
n-scores. 

Write a brief paragraph stating the relationships shown in the foregoing six graphs. 


5. As a check on normality use arithmetic probability paper and plot (a) p-scores 
against n-scores, (b) p-scores against w-scores. À 


6. Using the distribution for men only as given in the foregoing frequency dis- 
tribution, compute the table and draw the graph for transforming raw scores of men 
to (a) z-scores, (b) w-scores, with mean 50 and standard deviation 10, (c) p-scores, 


(d) n-scores. 


T. Using the distribution for women only as given in the foregoing frequency 
distribution, compute the table and draw the graph for transforming raw scores of 
women to (a) z-scores, (b) w-scores, with a mean of 50 and standard deviation 10, 


(c) p-scores, (d) n-scores. . 
Write a brief paragraph comparing the norms for men with those for women. 


8. Below is given the frequency distribution of A.C.E. scores for 113 students 
taking the 56 tests used in Dr. Thurstone’s first large study of primary mental abili- 
ties. (Thurstone, 1938, page 19.) This distribution is given in terms of percentile 
points on the national norms for the A.C.E. test. The table shows that there was 
one student between the 35 and 40 percentile points on the national norms; two 
students between the 45 and 50 percentile points; and so forth. It will be noticed 
that over 25 per cent of the students are above the 98 percentile point on the national 
norms. Can this distribution of 113 cases be regarded as a normal distribution, 
granted the assumption of a Gaussian distribution of intelligence in the 40,000 
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students on whom the national norms were based? Use the absolute scaling methods 
to answer this question. 


Cumulative 
Scores * Frequency Frequency 
35-40 1 1 
40-45 1 2 
45-50 2 4 
50-55 2 6 
55-60 £ 10 
60-65 6 16 
65-70 2 18 
70-75 4 22 
75-80 10 32 
80-85 6 38 
85-90 10 48 
90-95 25 73 
95-96 3 76 
96-97 4 80 
97-98 3 83 
98-99 6 89 
99-99.9 23 112 
99.9-100.0 1 113 
* National norms (1933), 40,229 cases, 203 colleges. 
9. 
Criterion (y) Selection Test (z) 
Minimum 
Set Tzy | Acceptable 
Standard | |, Standard y-score 
Mean Deviation Mean Deviation 
A 9.11 2.95 9.78 3.12 -74 6.00 
B | 30.42 9.25 20.48 6.75 .55 20.00 


For each of the foregoing sets of 
y-Scores, as a function of x, the sel 
appropriate for each set of data on 


data, graph p, the percentage of acceptable 

lection test score. Determine the cutting score 

the assumption that: 

(a) The net gain due to selecting an acceptable student is equal to the loss incurred 
by selecting one who will fail, 

(b) The gain is double the loss, 

(c) The loss is double the gain. 


ard deviation of 100. In 1946, form 
1000 eighth-grade students. 
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An equating test, form C, had been given to both groups of students when tests 
A and B were given, with the following results. For the second group, the mean 
was 23.4, and the standard deviation 6.8; for the first group, the mean was 21.3, the 
standard deviation 6.2, and the correlation rac was .75. It is desired to use linear 
derived scores for form B that will give scores directly comparable with the derived 
scores (mean = 500, standard deviation = 100), used for form A. 


(a) Write the transformation equation used on the 1942 group for form A. 
(b) Write the transformation equation that should be used with the 1946 group 
for form B. 


11. In 1944 a group of 1000 college freshmen were given two mathematics and one 
vocabulary test with the following results: 


Standard 
Mean Deviation Correlations 
Math. A 137.6 15.8 rac = .81 
Math. C 48.1 7.3 rap = .08 
Voc. D 206.4 25.7 rep = .51 


In 1947 another group of 1000 college freshmen were given two mathematics tests 
and one vocabulary test. The vocabulary test and the mathematics test C were 
identical with the tests given to the 1944 group. For the 1947 group, the results 


were as follows: 


Standard 
Mean Deviation Correlations 
Math. B 172.7 21.4 rge = .85 
Math. C 53.6 8.5 regep = .61 
Voc. D 213.2 28.9 rep = .49 


Test A has been converted to a linear derived score with a mean of 100 and a 
standard deviation of 20. 
(a) In order to make scores on test B comparable with the linear derived scores 
for test A, what arbitrary mean and standard deviation should be used for 
transforming test B? Use both tests C and D for equating. 


(b) Write the transformation for test A. 
(c) Write the appropriate computing equation to use for transforming scores on 


test B. 


12. One of the equations presented in connection with the discussion of the influence 
of group heterogeneity (see Chapters 10, 11, and 12) is analogous to one of the equa- 


tions in this chapter. Find and compare these two equations. 


20 


Problems of Weighting 
and Differential Prediction 


1. General considerations in determining weights 


When several test scores are available, on the basis of which a decision 
is to be made, we have the problem of the appropriate method of com- 
bining these scores. When a single total score is to be derived from a 
number of measures, this score should represent the standing of the 
candidates with respect to something. The type of judgment involved 
in determining what this something should be and various methods of 
combining scores will be considered in this chapter. 

It should be noted that it is not possible to dodge the weighting 
problem if any decisions are to be made. Occasionally we hear the 
suggestion that scores simply be added together without bothering 
about problems of weighting. No matter what scores we add, the weight- 
ing problem is not avoided. Adding the gross scores on a series of tests 
gives relative weights of one sort, adding standard scores gives relative 
weights of a different type. What information must be obtained, and 
what major questions must be answered in order to secure reasonable 
composite scores from pooling the components? 

It has also been suggested that a Separate cutting score may be deter- 
mined for each test so that we should use a combination of cutting scores 
jinstead of a weighted score. Franzen (1943) has presented a type of 
| "multiple chi" procedure for determining the best combination of cutting 
‘scores. “This procedure consists essentially in trying all possible com- 

binations of various cutting scores, and then using the one that turns 

out to be best for the set of data in hand. Systematic short-cut compu- 
tational procedures are also presented by Franzen. 

In connection with multiple-cutting scores, it must be noted that 
policy changes that seem slight may in effect produce a marked difference 
in standards. In Figure 1 we see that, if a person must pass both tests 

. to be accepted, only group 2 will be accepted. If passing either test is 


acceptable, groups 1, 2, and 3 will be accepted. It should also be noted 
312 
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that, if those who fail the first time are allowed to try a second and a 
third time, this policy is equivalent to saying that the person is accept- 
able if he passes either test; hence many more will pass. 


Cutting 


Test Y 


Score 


Test X 


Figure 1. Illustrating the difference between passing both and passing either test. 


The difference between multiple-cutting scores and a weighted com- 
posite may be seen by referring to Figure 2. In this figure the cutting 
scores are adjusted so that the same number will be passed by each 
policy. If both tests must be passed, we accept those above and to the 
right of line abe. If passing either test is acceptable, the person must be 
above or to the right of line def. In the first case, persons in areas 1, 4, 
and 5 are accepted; those in areas 2, 3, 6, 7, and 8 are rejected. In the 
second case those in areas 1, 2, 3, 6, and 7 are accepted; those in areas 
4, 5, and 8 are rejected. If the number of persons in area 4, plus the 
number in area 5, is equal to the total number in areas 2, 3, 6, and 7, 
the same number will be accepted by either system. Likewise, the 
number rejected will be the same for either system. The use of a 
weighted score is illustrated by the line gh. In using this line we accept 
those in areas 1, 2, 4, and 6, while rejecting those in areas 3, 5, 7, and 8. 
By all three methods everyone in area 1 is accepted, and everyone in 
area 8 is rejected. The methods differ only in the disposition of persons 
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i imi in the areas 2, 3, 4, 5, 6, and 7. It may be 
cae - enitn NN results in putting areas 
E 5 pe difference scores near zero) together as either accepted 
ee Also persons in areas 2, 3, 6, and 7 (with fae pore 
or negative difference scores) are classed together as rejected or accep E: 
Thus the use of multiple-cutting scores would be justified by a ie 
linear relationship between the criterion and the difference score. 


Test Y 


Test X 


between multiple cutting scores and a weighted 
composite. 


Franz 2. Illustrating the difference 


this relationship is linear rather than curvilinear, the use of a straight 


line such as gh would be more appropriate than a multiple-cutting score. 
Since in general linear relationships have been found adequate for most 
test work, we shall limi i i 


r to a discussion of 


ationship between two sets 
Ovariance or correlation. It is im- 
sets of weights being considered are 


S 
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similar to each other, the two weighted composites will correlate highly 
with each other. For example, if tests A, B, and C receive weights 
1, 2, and 3, respectively, or 1, 2, and 4, respectively, the resulting com- 
posite scores will be very similar. The weights 3, 4, and 5 will give 
essentially the same results as 2, 4, and 6. Unless we are considering 
radically different sets of weights, the resulting scores cannot be altered 
much by changing from one set of weights to the other. If the two sets 
of weights have a low intercorrelation, the correlation between the 
composites will be determined by the ratio of the standard deviation 
of the distribution of the weights to the mean of the distribution, and 
also by the properties of the test battery, such as the number of tests 
combined and the correlation between these tests. 

Tt will be found that, if the standard deviation of the set of weights is 
very large in comparison to the mean, changes in weights used can 
produce great changes in scores regardless of the number of variables 
to be combined, and regardless of their intercorrelation. For example, 
if both positive and negative weights are permitted, the mean of the 
distribution of weights will be near zero, while the standard deviation 
will be very large. If freedom of this type is allowed in weighting, two 
composites may have very low correlations regardless of the number of 
variables combined and regardless of the intercorrelation of these 
variables. However, if the mean of the distribution of weights is about 
equal to or larger than the standard deviation of the distribution of 
Weights and if the correlation between the two sets of weights is low, 
the correlation between the two composites will depend largely upon the 
number of variables involved and upon the intercorrelations of these 
variables. This case is important, since it is the usual one found in the 
Weighting of items to give a total test score, and of tests in an aptitude 
battery to give a composite score. Limiting our consideration to sets of- 
positive weights with low intercorrelations, we find that the composites 
will not be different unless there are relatively few variables to be com- 
bined and a low correlation among these variables. 

Thus we have seen that in considering the effect of weighting on a 
composite the test battery may be characterized by two variables: 
(1) the average intercorrelation between the tests and (2) the number 
of tests. The weights likewise may be characterized by two variables: 
(1) the ratio of the standard deviation to the mean of the distribution 
of weights and (2) the correlation between the two sets of weights. In 
order to demonstrate the effect of each of these factors on the correlation 
between the two composites, it is necessary to write an expression for 
the correlation between two weighted sums. 
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Let us consider a set of standard scores z and two different sets of 
weights designated by V and W. We have 


K 
(1) Xy; = Vitus + Vozoi +--+ Vk2ki = DS Valg 
g=1 


The composite score (Xy) for individual 7 is equal to the sum of the 
products of z-scores for that individual, each multiplied by the assigned 
weight (Vj). In like manner we may write another composite Xy, 


obtained by applying a difference set of weights (W,) to the same set of 
standard scores, namely, 


je 
(2) Xy; = Wiz + Wsazoi +---+ Wiz; = D W gei 
z1 


The composite score (Xy) for individual 7 is the sum from g=1to K 
of the products of the z-scores for that individual, each multiplied by 
the assigned weight (W,). In order to indicate the influence of two 


different sets of weights, we shall write the correlation between Xy 
and Xy, 


N 
D XviXwi 
(3) Ryyxy = = 


It should be noted that, since the z-scores have a zero mean, the X-scores 


Will also have a zero mean; hence the gross score formula for correlation 
need not be used. 


Substituting equations 1 and 2 in the numerator of equation 3 and 
expanding, we have 


(4) ZEZXvXw = VaW.xa? + ViWiZaz, +--+ VW Suen 


+ ViWoZzz,. + VYW.Xe? Tee VRWiezy-- 


2325 


TOVAWxZaek + VIWuEzgk fee VW xdzx?, 
where it is understood that all summations are over individuals 
Ge 1 E E^ the 2’s are standard scores, Zz,? — N and 
een = Tg, (g z^ h). If we make these substituti indicating first 
the sum of the K diagonal poh d 


terms and then the K? — K non-diagonal 
terms, we have 
N K K 
(5) x Xy;Xy; = x VWN + > V4WargN. 
-= g= g=h=1 
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It must be noted that there are K terms in the first summation and 
K? — K terms in the second summation. By substituting V for W 
in equation 5, we may write the first factor in the denominator of 
equation 3 as follows: 

N K K 
(6) E Xrê = VD VEN + L VeViraN. 

i=l g=1 gæ=h=1 
By substituting W for V in equation 6, we may obtain an expression 
for the second factor in the denominator of equation 3. Substituting 
equations 5 and 6 in equation 3 and factoring N out of both numerator 
and denominator, we have 


Equation (7) 


K K 
x VWs + p» V,Wyrgs 
g=1 gæh=l1 


Rxvxw = —rg K K K 2: 
E Vet M ViVana 232W?-c- X Wr 
g=1 geh=l g—l gh=1 

Again it must be remembered that the single-subscript summations 
contain K terms, whereas the summations involving both g and h con- 
tain the K2 — K non-diagonal terms. Let us now consider the numer- 
ator term of equation 7. We may use Cyw to designate the covariance 
between the two sets of weights and write 


Ie = 
(8) Cyr = (z) > VW: — VW, 


g=l 


where V is the mean of the V’s and W is the mean of the W’s. Solving 
equation 8 for ZVW, we have 
(9) ZV,W, = K(Cyw + Viv). 


We may also introduce the concept of the covariance of rg, with the 
product V,IV;, designated by Cor; This term will in general have a 
lower bound of zero and an upper bound equal to the product of the 
Standard deviation of r and the standard deviation of VW. We are 
limiting outselves: here to, the conventional ease in which highly inter- 
correlated parts are given more weight than those with low intercorrela- 
tions. : 
Following the form of equation 9, we may write 


K — 1 
(10) X Vaya = C — KK Coi + VW. 


gth=1 
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Referring to equation 4, we see that the term (VW) is the mean of only 
the non-diagonal product terms of the form VW}. Summing the VW 
terms by columns gives 


VZW + VW +--+ VkZW = (ZV)(ZW) = KVKW. 
To obtain (VW) we deduct the sum of the terms in the principal diagonal 
and divide by K? — K, obtaining 


TE KVKW — ZV,W, 
(11) (VW) = uS. 9 em 


Substituting equation 9 in equation 11, and then the rewritten equation 
11 in equation 10 and rearranging terms, we have 


E — — 
02) È VeWaren = (K? — K)(Cww) + VWF) — KCvwr. 


gæh=1 


Combining equations 9 and 12, we may write the numerator of equation 
7 as follows: 


K 
25 VWriren = KA — 7)(Cyw + VW) 


g+h=1 


K 
13) 3 VW, + 
gzl 


+ (K? — K)Cww)r + FeV Wi 
where 7 is the average intercorrelation of the subtests, 
V is the average of the V. -weights, 
W is the average of the W-weights, 
K is the number of scores to be combined, 
Cy is the covariance between the two sets of weights, and 
Cow; is the covariance between Tgn and the product V, Wp. 


, 


By substituting V for W in equation 13, we may write the first factor 
in the denominator of equation 7 as follows: 


K K 
M4) Deve + E VeViren = KO 5(4 TA 
g=1 


gsh—1 
d UE e K)C(yyy, + K2V?r. 
The covariance term Cy, 


it should be noted, changes into the variance 
of V, which has been designated by the symbol V2. By substituting 
W for V in equation 14, an expression can be written for the second term 
in the denominator of equation 7. Substituting equations 13 and 14 
in equation 7, we obtain the final expression for the correlation between 
the two weighted sums, 
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Equation (15) 
KO — P(Cyw + VW) + (K? — K)Cw m + VW 
Ka» +y) [ka—-»QP?--m» - 
+ (Kk? = K)Cwwr Eus (K? — K)Cannr 
OV? LOW 


Exyxy 


"This equation expresses the correlation between two composites obtained 
by using different weights, in terms of 


K, the number of scores to be combined, 

W and V, the averages of the W and V weights, 

W and Y, the standard deviations of the two sets of weights, 

7, the average intercorrelation of the scores to be combined, 

Cyw, the covariance between the two sets of weights used, and three 
terms of the form 

C; the covariance of a product of weights with 7ga. 


Horst (1941), pages 379-401, contains a discussion by M. W. Richard- 
son of the principles to be followed and the precautions to be observed 
in deciding upon a set of weights. Richardson presents an equation 
analogous to equation 15, but derived from more restrictive assumptions. 

To see what happens as K increases, we may divide the numerator 
and denominator by K? and omit all terms which have (1/K) as a factor. 
This gives 


(16) 


Cow: + VWF 
Ryyxw > Cor TV A Corm + Wr 


if the covariance terms are equal and the mean 
V-weight equals the mean W-weight. In particular, if the covariance 
terms are near zero they may be ignored; and in this case Ry, xy ap- 
proaches unity regardless of the value of the mean weights. . 
Also we learn from equation 15 that, if the average intercorrelation 
of the items (F) is near unity, the factor a = fr) approaches Zero, and 
equation 15 approaches an expression similar to that given in equation 
16. That is, when F in equation 15 approaches unity, Fxvxy approaches 
Unity if the covariance terms are equal and if V — W. It is also true 
in this case that Ryyxw approaches unity if 7 approaches unity and the 
Covariance terms approach Zero, regardless of the values of V and W. 
It should also be noted that, if positive, zero, and negative weights 
in any combination are allowed, either or both V and W may be zero, 
and Y and IW may be either large or small in relation to finite values 
of V and WV. If such freedom in selection of weights is allowed, we can 


which is equal to unity 
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learn from equation 15 and from the limit given in equation 16 that 
Rxyxw May assume any value, regardless of the intercorrelation of the 
weights, the number of variables, or the average intercorrelation of 
these variables. 

If all the weights cannot be positive, we are considering the situation 
in which V and W are small in relation to 7 and W. Assuming that the 
terms containing V or W may be ignored as being small, we see that 
Rx,xy depends primarily on the four covariance terms Cvw, Cwyr 
Cwyn and Cary... The number of variables to be combined (K) and 
the average intercorrelation of these variables (F) are of only minor 
importance in determining Rx,x,, when V and W are both small. 

Let us examine equation 15 to see what happens as the correlation 
between the two sets of weights increases. Under this condition, the 
term Cyw approaches VIV, and the covariance Cty, becomes similar 
to Civy)r and Cay), so that, as ryy approaches unity, the value of 
Rxyxw is dependent upon the value of V and W. Thus we see that 
as ryw approaches unity, Rxyx, also approaches unity, provided that 


We may also see that, as the standard deviations of V and of W are 
decreased, the covariance and variance terms in equation 15 decrease. 
In the limit these terms will vanish. Dividing the numerator and the 
denominator of the remaining terms by VIV, we find that Ryyxy 
approaches unity as the variance of the weights is decreased, provided 


that the terms V and W do not approach zero. 


Summarizing the information furnished by equation 15 and 
the limit given in equation 16, we see that: 


A. If either or both V and W may be zero, Rxyxy may 
assume any value regardless of the value of 7, K, or the 
various covariance terms involving the weights. 

B. If V and W are smal 
depends primarily 
Com, Cory, 
to changes in the 


l in relation to V and W, Ryyxy 
on the four covariance terms Cvw, 
and Cari): and is relatively insensitive 
values of and K. 
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to be combined is increased. It should be particularly 
noted that this last effect holds, even if the correlation be- 
tween the two seis of weights (ryw) is zero, but that F must 
be greater than zero. (d) As the standard deviation of the 
weights (V and W) is decreased in proportion to the mean 
weights (V and W), Rxyxw approaches unity regardless of 


the values of 7, V, and W. 


From the practical point of view, we may say that, if a large number 
of scores are to be combined (of the order of 50 to 100) or if the scores 
have high intercorrelations, it makes relatively little difference what sets 
of positive weights are assigned. The computationally simplest set 
would probably be the best one to use. If, however, we are combining 
only a few scores (for example, three to ten), and the average intercorre- 
lation is low (.5 or less), differential weighting equations may profitably 
be considered. However, the set of weights must have a large standard 
deviation if it is to give results appreciably different from the set 1, 
1, --., 1; also if two sets of weights have a high intercorrelation it makes 
little or no difference which set is used. 

Wilks (1938) has dealt with the special case in which the weights are 


distributed so that V/V. = 1 and W/IV = 1, and the weights are inde- 
pendent, of each othér and of the correlations between the variables to 
be weighted. It should be noted that the usual practice in weighting 
items in a test is to use positive weights so distributed that the standard 
deviation will not be large relative to the mean weight. Furthermore, 
alternative sets of weights, the two sets considered are 
usually positively correlated so that the case dealing with weights inde- 
Pendent of each other, which is considered by Wilks, will give composites 
that correlate less than the alternative composites usually considered in 
Practice. It is also important to remember that, if we are willing to 
Consider two sets of positive weights that are negatively correlated with 
cach other, the correlation between the resulting composites will be 
lower than that indicated by Wilks’ formula, given as equation 47 in 
this chapter, This formula may be derived in the following manner. 

If a sample of K-weights (V;) are drawn from an infinite population 


With mean a, and standard deviation cv, the variable 


m considering 


(V i a) VK 
a A 


Ov 


has a zero mean and unit variance regardless of the magnitude of K 
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(the number in the sample). (See Wilks, 1943, page 81.) The mean 
of a given sample may thus be expressed as 
Yea 
=a ey 
” T AE 
The sample mean is thus expressed as a function of the population mean, 
and standard deviation, the number of cases in the sample, and a 
variable (e,) with zero mean and unit standard deviation. The standard 
deviation is thus independent of the number of cases in the sample. 
Since we have limited the treatment to the case in which the weights 
V are independent of W, the mean of the distribution of products VW 
for the entire population sampled will be equal to the product of the 
means of the two populations (a,a,,). Hence we may write 
K 
25 VWs 
(17) fas QV, + a 
K eli Ne? 
where e; is a random variable with zero mean and a constant standard 
deviation independent of 4/K. 
Likewise the mean of the distribution of products VWr for the entire 


population of weights and correlations will be equal to the product of 
the means (a,a,7). Thus we have 


K 
D V Wire 
(18) e+th=1 C2 


- = = 6,27 + SSS 

K? -K ut ATTE) 
where €» is a random variable with zero mean and a standard deviation 
independent of K. 


Similarly, if We use b, to designate the second moment about the origin 
for the population of weights V, 


K 
Ake 
K "VK 


Correspondingly, the mean of the 


K 
b VeVar gh 
h=1 
20; = Sate e4 
e A SÉ EET nsu 


(19) 


product VeV hren may be written 
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Substituting IV for V in the two foregoing equations, we obtain appro- 
priate expressions for the W-weights. 

Substituting equations 17, 18, 19, 20, and analogous equations for 
the weights IV in equation 7, we have 


Equation (21) 
„ê IE i) 1 ez | 
L == K^ — K) | au? SS 
K [eves F VK VK — K 


|x |». + E Tt = K) [ser + x | 


Rxyxw = 


x [x e Se] e - [n T] 


In order to determine the factors influencing the composite as K becomes 
large hall define a new variable, 
ge, we shall d 


1 
"= VE 
as a function of y and expanded in Taylor's 


A/K = 1/y in equation 21, multiply 
define three new functions, 


R may then be regarded 
series about y = 0. If we set : 
numerator and denominator by y; and 


Gy), H (y), and F(y), we obtain 
ý H (y) 


»» Ravxw = 00) = p () FL) 


Where 
(23) H(y) = aa, 
(24) — F,(y)- by? + eu + aer — y?) d e? VA = y), 


Y H eu + aur — y) + ey? V1 y, 


and 
y? Vile y’. 


Rypxw 98 2 function of the variable y, and the 
The problem is to evaluate this function in the 
Taylor’s theorem for small values of y, we 


(25) Faly) = buy? + esy? + ar A y) + es 


We are now regarding 
Parameters a, b, and 7. 
Vicinity of y — 0. Using 
may write 


2 


y n 
(26) Gly) = GO) + yG'(0) + (2) G” (0). 
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Setting y — 0 in equation 22, we see that 
(27) EO = 1. 


The problem is now to evaluate G'(0) and G” (0). 
To evaluate G'(0), we differentiate equation 22, obtaining 


ow = H'(y) Gy) E 7. 
SUIT VERD 3 LED Fely) 


To determine the value of equation 28, when y = 0, we need the deriva- 
tives of equations 23, 24, and 25. These derivatives are 


H'(y) = 2avawy + 3e! — 2a,a,7. euh) 
ç = wl ey — 24yAyTy — ee —7———, 
(29) y UD md ery a y VATES: 
7 2 2- 3y? — 2y 

(30) F oy) = 2by + Segy^ — 2a Ty — €4 p , 
and 

3y? — 2y 
(31) F' oly) = 2byy + 3egy? Rs 2a,?ry — & SS" 

Vi- y? 
Setting y = 0 in equations 28 to 31 inclusive, we find that 
(32) ¥’,(0) = F',(0) = H'(0) = G'(0) = 0. 


To evaluate G” (y), when y = 0, we differentiate equation 28, obtaining 


» H” (y) H'(y) peo zx 
33) G = — 
3D e VG.) 27 FF.) EF) Foly) 
_ G'(y) ES 2) 

2 L. Fu) Foly) 


GG) ew ey Fay) (Su 
2 FP) Fy) Foly) Foly) 
Let us set y = 0 and substitute equations 27 and 32 in equation 33, 
obtaining : j 
(34) G'(0) = H" (0) F” (0)  P',(0) 


VEF 27,0) 25,0) 
The termsin the denominator d " A 
tions 24 and 25. The terms Jue evelated from ara 


MA 3 however, require the 
derivatives of equations 29, 30, and 31. These derivatives are. 
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Gyt — 9? + 2 
aya 
6y* — 9y? + 2 
-a * 
yt — 9y? + 2 


(35) H'"(y) = 2a,a, + bery — 2a,avF + es 


(36) F"(y) = 2b, + Gesy — 2a,7F + e 


(37) F” (Y) = 2b, + 6egy — 2ay?r + es ———.—- 
a=)" 

Setting y = 0, we have 

(38) H"(0) = 2a,a, — 2a,a,r + 2es, 

(39) F",(0) = 2b, — 2a,7F + 264, 

(40) F" (0) = 2b, — 2av^r + 2eg. 


Substituting equations 38, 39, and 40, as well as equations 24 and 25 
in equation 34, we have 


2e5 -[ER LA. 


25 25 
Ur Ger 


4) @"@ = 2-24 


T Arlu? 
If we factor out — 1/7 in equation 41, and rearrange the terms, we have 


Ux. bw 2€» 7 
(42) G" (0) = -(-)| ee 4 1 [d 4 = + s| 


2 2 
T/ Lay aw Adw ay [m 


From the gross score formula for variance, we see that 


(43) EU EN TS 
and 
(44) owe a bw > ES 


Substituting equations 43 and 44 in equation 42, we have 


: 1 4 x 2€» €. €i 
45 — q"(0- ~(-) [+2 --— ++ Ji 
; 


2 
ay aw Alw ay aw 


Substituting equations 27, 32, and 45 in equation 26 and setting 
y= 1/K, we have 


l Gy. o 2t» e4 €6 
OPNE re NICO E 
( 3) Ryyxy L FK E Aue (us a n 


If we consider the expectation of R, the variables designated by e; 
will vanish, since each of these variables was defined in such a way as 
to have a mean of zero and a constant standard deviation. Designating 
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the mean value of R by R, we have the final formula given by Wilks 
(1938), page 26: 


Equation (47) 


ue [s zd „jÍ EE] 
rmen 1^ ae S 2K LAV ws J' 


where R is an approximation to the mean value of the correlation 
between two weighted composites, 
7 is the average intercorrelation between the variables being 
combined, 
K is the number of variables being combined, 
c» and ow are the standard deviations of the two populations of 
weights being considered, and 
4, and a, are the means of the two populations of weights being 
considered. 


In the absence of information on the mean and the standard deviation 
of the population of weights being sampled, the values for the sample 
(V, W, V, and W) may be used instead. The variance of R is given by 
terms of the order 

z if 2e. es es 1? 
(R — Ry 2-5-5]. 


4A?K?laa, a? ay? 


Since each term e; has a constant variance independent of K, we see 
that the variance of R is of the order (1/K)? so that the individual R 
terms will vary from R by terms of the order 1/K. 


Equation 4? may be used as an approximation of the corre- 
lation between two weighted composites. It should be noted 
that the equation does not apply if (a) the average intercorre- 
lation of the variables (F) is near zero or is negative, or (b) 
negative and positive weights are used so that a is near zero 
and small in relation to a, or (c) the correlation between the 
two sets of weights Cow) is negative. Under any one or more 


of these conditions, a more general equation such as equa- 
tion 15 must be used. 


For example, equation 47 indicates that, if the quantity 2K7 is thirty 
or larger, there is no point to bothering with different weighting systems, 
unless we are prepared to consider negatively correlated weights or 
weights some of which are positive and some negative. 

In arriving at equation 47, Wilks assumed that there was no prob- 
ability dependence between the V’s and W's, There might or might 
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not be such dependence between r and V or r and W. He also assumed 
that F was greater than zero, and that the number of rga values greater 
than zero was of the order of K?. The use of generally positive weights 
was also assumed, that is, V and JV were assumed to be larger than F 
and WW, respectively. 


2. Predicting an external criterion by multiple correlation 

If an external criterion is available, and we desire to weight the sub- 
scores in such a manner that the composite score will have the highest 
possible correlation with the eriterion, the method of multiple correlation 
is the one to use. Again it must be remembered, as pointed out in the 
preceding section, that the precise method of weighting is not important 
unless we are dealing with relatively few tests that are not highly corre- 
lated with each other. 

We will present the proof, using calculus and the solution of linear 
equations by determinants. 

In the multiple correlation problem we have one criterion or de- 
pendent variable, which is to be approximated as closely as possible by a 
weighted sum of the independent variables. We may write 


K 
(48) dio = dita + Doris boob butik = DX betio 

g=1 
where tio is the predicted criterion score of the 


ith individual, 

by, Q =1 - K) is the weight assigned to the gth 
test, and 

N3g=1--: K) is the deviation score of the ith 


tig, (Lm 1 
individual on the gth test. 


The multiple correlation problem is to choose the values of b; so that 
the correlation of the criterion scores (ro) with the predicted criterion 
scores (čo) will be as large as possible. This is the same as making the 
sum of the squares of the differences between 2 and dg as small as 


possible. We may write T 


(49) E = Dd (to — io), 


i=l 
where z;o is the criterion score of the ith individual, and Æ is the error 
of prediction. The multiple correlation problem is to select the b’s 


that will minimize the value of Æ. : 
If we substitute equation 48 in equation 49, and set each of the deriva- 


tives (QE/db,) equal to zero, we have 
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Brozi — byDaxy? — byXxax, —---— byXagzr; = 0 

Eroto — byEayxq — boDxy? —---— bgĒrgr = 0 
(50) * . . . 

LaorK — byDxyxK — boXrexy —+--— byXxgy? — Q0. 


The equations 50 give the b’s in terms of the variances and covariances 
of the independent variables, and the covariances of the dependent with 
each of the independent variables. 

The solution of equations 50 can be expressed in determinantal form. 
Let the determinant 


1 Tio To Teo ta Fo 
9o) 1 To T3 co TRE 
Too fæ ] Tg sit ng 
(51) A =| o Tis Tog ] ZI 
Tok TiK T2eK Tak c 1 


Let Ago be the determinant formed by deleting the first row and the 
first column of A, Ag; be the determinant formed by deleting the first 
row and second column of A, Ao» be the determinant formed by deleting 
the first row and third column of A, and in general let Ao, be the deter- 
minant formed by deleting the first row and the (g + 1)-th column of A. 
Then the solution of equations 50 is given by 


A 
boi.234 ... K = (—1)° 40180 
Aoos1 
bo2.134 -.. K = (-1)! A0280 
Aoose 
(52) bosisa .. x = (-1) Ao3so 
Aoos3 


box.1234 -.. ei) = (—1) 5—0 oxo 
Aoosx 
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In general, we may write 
Aogso 


(53) bois nirp = (—De? 
AopSg 


When the multiple regression weights have been determined accu- 
rately for a set of variables, it is well to remember the proof given in 
the preceding section that two sets of highly correlated weights will give 
highly correlated composites. This means that, instead of using the 
awkward fractional weights indicated by the multiple correlation solu- 
tion, we approximate them by a set of simple integral weights. 

The weights indicated in equations 52 when used in equation 48 will 
give the best estimate of xo in the sense that the error (E) indicated in 
equation 49 will be a minimum. Dividing Æ by N, the number of cases, 
and taking the square root gives the error of estimate, which may be 


written 


E r3 
(54) I n o Am 


The weighted sum to given by equation 48, using the weights of equations 
52, correlates higher with zo than any other possible weighted sum of 


the independent variables xı -** x- This correlation is the multiple 
correlation, and its value is given by 
(55) Roaz eK = Aq 


For the simplest possible case of multiple correlation, that of predicting 
one criterion (xo) from two independent variables (vı and xə), these 


equations may be written very much more simply. Equations 52 for 


the weights of a, and a2 become 
so(ror — lori) 
boo = s — n$) 
(56) 
so(ros — Toii) 


boa = s — n3?) 


The error of estimate of equation 54 becomes 


TES > OL. A 
1 + 2roi7o2712 — Toi — Tog — M12 
5 ; . 
(57) $912 = So jT 012? 


A 
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'The multiple correlation given in equation 55 becomes 


Toi? + ros? — 2royrosris 
(58) Tigis = 


z 
L= n° 


The equations 56 to 58 give the weights, error of estimate, and multiple 
correlation for the three-variable case in terms of the three intercorrela- 
tions and three standard deviations of the original variables. 

It should be noted that it may readily happen in multiple correlation 
methods that a given test is assigned a negative weight. This means 
that the better a person does on that test, the poorer will be his composite 
score, and vice versa. Such weights should lead to a careful scrutiny 
of the test and a consideration of the reasonableness of such a finding. 
Many situations arise where a negative weight is plausible. However, 
it should be noted that in a test which is to be given repeatedly, and for 
which the scoring method may become known to the candidates, it 
would be very unwise to retain a test with negative weight, since it is 
very easy for the subjects who know the scoring method to attempt to 
obtain, and succeed in getting, a low score. Such a change in motivating 
conditions would destroy any predictive value that the test might have 
had previously when the subjects were all attempting to obtain a high 
score. Adkins et al. (1947), page 170, indicates that negative weights 
are not used in civil service tests. 


In dealing with more than three variables, it is necessary to use special 


computational methods, such as those described in Guilford (1936b), 
pages 390-404. 


If a criterion is available, multiple correlation methods give 
the best weights for predicting that criterion. 

gral approximations to these weighis will usuall 
posite score that correlates almost as well with 


Simple inte- 
y give a com- 
the criterion. 


3. Selecting tests for a battery by 


approximations to 
multiple correlation 


In addition to specifying the best set of weights to use for each of 
the tests in a battery, it is frequent 


ly desirable to eliminate some of 
the tests as well. For example, we might use six 
mental battery, and wish to know which ¢ 


multiple correlation. It would also be desirable to know what the mul- 
tiple correlation was for all six tests, as well as for the best set of three, 
in order to see how much was lost in predictive accuracy by eliminating 
the poorest half of the tests. 


subtests in an experi- 
hree would give the highest 
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The only certain method of obtaining an exact answer to such ques- 
tions is to work the zero-order correlations, the multiple correlations for 
all possible combinations of two tests, for all possible combinations of 
three, or four, or five, and the multiple correlation using all six tests. 
This would mean, for six predictor tests and one criterion, computing 
15 + 20 + 15 + 6 4- 1, or 57 multiple correlations. We could then 
easily pick the best combination of tests at each stage and decide 
whether the additional testing time was adequately repaid in terms of 
higher validity coeffieients. With the ordinary computing methods in 
use at present, the labor of such computations makes ‘such analyses 
prohibitive. It may well be that, with the development of high-speed 
electronic computing machines, the exact solution of such a problem 
would be more economical than many of the approximation methods now 
in use. ! 

Frisch (1934) described a method of dealing with what he termed 
“complete regression systems” by “confluence analysis.” This method 
essentially involved computing multiple correlations and multiple 
regression weights for all combinations of the variables involved in 
order to understand thoroughly the relationships among these variables. 

One very good approximation method is to look first at the zero- 
order correlation coefficients, and select the one best test. This test is 
then tried out with each of the K — 1 remaining tests to see which two 
(including the one best) will give the highest multiple. These two best 
are then combined in turn with each of the K — 2 remaining tests to 
Pick the “best” combination of three. With this method we should 
select three tests out of a set of six by working multiples for only 5 
two-test composites and 4 three-test composites, that is, 9 instead of 57 
multiple correlations. Such a method has been described by Toops 
(1923). Other closely similar procedures have been described by Wherry 
(sce Stead, Shartle, and associates, 1940, Appendix V), by Toops (1941), 
and by Wherry and Gaylord (1946), and Horst (1934b). . 

If we are willing to assume that the best set of two tests includes the 
best one, the best set of three ineludes the two previously indicated, 
and so on, and in addition to assume that the relative weights determined 
for the e two also hold when these two are combined with a third, and 
80 on up, a very quick and easy graphic approximation method has been 


Provided by Jenkins (1946). 
4. Weishting according to test reliability or inversely as the 
i<j iz 


error variance 
Giving the more reliable tests gre 
Suggested by Kelley (1927), pages 


ater weight in a composite has been 
211-213, Thurstone (1931a), pages 
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88-90, and Richardson [see Appendix D, pages 392-396, in Horst 
(1941)]. Kelley and Richardson give the gross score weight of any test 
(g) as Teg/(1 — Teg), that is, the weight is the ratio of the reliability 
coefficient to the error variance of the standard scores. "Thurstone 
follows a slightly different procedure and finds the gross score weights to 
be VTgs/(L — rge). The former formula can readily be derived from 
multiple correlation theory with two assumptions. The first is that all 
the tests are measures of the same true score, except for the fact that 
they contain different proportions of random error. This assumption 
means that the intercorrelations will be unity, when corrected for 
attenuation. The second assumption is that we wish to maximize the 
correlation between this common true score and the weighted composite. 
Stated in mathematical terms, these assumptions mean that 


(59) Teh = V Tggrhh (g#h=1---K), 


the intercorrelation between any two tests is equal to the geometric 
mean of the reliability coefficients. The criterion may be assumed to 


be the true score, in which case the validity coefficient of each test is 
given by 


(60) Tie = Virg, 


or it may be assumed that the criterion is another test xo, which also 


has the same true score, in which case the validity coefficient of each 
test is given by 


(61) Toy = V Tool gg: 


Identical relative weights are given by cither assumption. In the 
following derivation we shall use equations 59 and 61. We see that the 
criterion reliability (roo) is a factor common to all the weights, hence 


may be ignored if we are interested only in relative weights. Substi- 
tuting equations 59 and 61 in equation 51, we have 


Vroori 1 Vri 1722 LA rrr 
(62) A= V rooraz SET, 1 «ms po 


"11/23 


ee 
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Since the factor so/Apo is common to all the weights in equations 52, 
we may ignore it and evaluate terms of the form Ao,/s; to determine 
the relative weights for the variables. We may form Ao; by deleting 
the first row and second column of equation 62 and then transform the 
determinant Ao; by multiplying the first column by +/722/+/roo and 
deducting the product from the second column. In general multiply 
the first column by 4/7,;/*/roo and deduct the product from column g. 
These transformations do not alter the value of the determinant, so 
we have 


V Too" 0 0 sss 0 ^ 
V rogos — l— Te» 0 sug 
(63) Ao =| Vroorss 0 v MEL 


Virorkk 0 0 gi aaa ic 
Expanding equation 63 in terms of minors of the first row, we have 
(64) Aor = Vrooris (1 — 722)(1 — 133) ++ (1 — rex). 


If we multiply and divide the right side of equation 64 by (1 — rji) 
and let 
P = Vro (1 — ri) — rez) — 733) ~ (1 — trx), 


we have 
PV fi 


1— Ty 


Aoi = 


We may write all the other terms of the form Aog similarly, omitting 
the common factor P in order to deal only with relative weights. If 
the factors common to all the weights are designated by C, and the 
Standard score weight for test g is indicated by Bog.12 ... x, we have equa- 


tions of the form 
V Teg 


L= Teg 


(65) CBog.12 --- K = 


From equations 53 of this chapter and 20 of Chapter 3, we obtain 
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the weights appropriate for use with gross scores. "These weights, desig- 
nated by bog.12 ... x, are 


Tgg - 
1 - DS. 

(66) Ç bo, 12 Seng Ta fee x (g £) 
where C” designates the factors common to all the weights. Weighting 
formulas 65 and 66 were presented by Kelley (1927), pages 211-213. 
The detailed derivation has been given by Richardson in Appendix D, 
see Horst (1941, pages 392-396). Thurstone (1931a) has suggested the 
use of weights dependent on reliability, which differ from these by a 
factor of Irae 
The use of the weights of equation 66 depends upon the assumptions 
indicated in equations 59 and 61. Whenever several tests and their 


reliabilities are available so that the weights of equation 66 may be 


used, it is also always possible to calculate the intercorrelations among 


these tests in order to verify the assumption in equation 59. It is 


probably only in exceptional cases that this assumption would be 
verified. Constructing a set of tests th 


is a fairly difficult job. Equation 59 is 
than a single factor battery. Tests m 
differ both in error and in factors 

does not allow for any possibility of 


a far more stringent requirement 

ay have one common factor, and 
specific to each test. Equation 59 
a factor specific to each test. 


In summary, we may say that, while it is usually desirable 
to give greater weight to the more reliable test, there is usually 


no special justification for the particular weights indicated by 
equations 66. 


5. Weighting inversely as the standard deviation 
Weighting gross scores in tests by the reciprocal of the standard 
deviation has also been freq 


uently given as one method of combining 
tests. See Kelley (1927), page 66, Thurstone (19316), pages 83-87, and 


others. It should be noted that such a weighting principle is justified 
only in highly specialized and unusual cases. For example, if the true 
variance of the group tested is large, this will contribute to making the 
standard deviation of the test large, and would Seem to be no valid 
reason for decreasing the weight of the test. On the other hand, test 
score variance may be increased i ing the error variance of a 
eased weight for the test is plausible; 
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deviation of a test is its length. A 100-item test will have a much larger 
standard deviation than a 10-item test. Clearly on a common-sense 
basis, we should not increase the accuracy of a test by lengthening it 
from 10 to 100 items, and then reduce the weight of the better test by 
weighting it inversely as its standard deviation. A detailed criticism 
of the method of weighting inversely as the standard deviations is given 
by Richardson. See Horst (1941), Appendix D, pages 385-388. 

It is interesting to note that under certain highly specialized conditions 
the multiple correlation weights of equation 53 become equal to the 
reciprocal of the standard deviation of the test. If it is assumed that 
all the test intercorrelations are equal to some value, r, for instance, 
and that all the validity coefficients (rog) are equal to some value, for 
example, v, then Ag; = Ao» =+- = Aog, so that all the weights indi- 
cated in equations 52 are identical except for the standard deviation 
appearing in the denominator. In other words, if all the independent 
variables in a set are identical with respect to validity, and have identical 
intercorrelations, so that no special test clusters are formed, and if in 
spite of such remarkable similarity the tests still differ in variability, 
the multiple correlation weights are inversely proportional to the standard 
deviations of the tests. 

It should also be noted that various methods of scoring a test may 
also have an effect on its standard deviation. For example, if two 100- 
item tests are scored differently, test a receiving one point per item and 
test b ten points per item, then if the tests are reasonably similar the 
Standard deviation of b and its influence in any composite would be 
very much greater than that of a. Similarly, tests scored number right 
will have a different standard deviation if the scoring system is changed 
to R — oW. . 

The standard deviation of a test is an important factor in determining 
the influence of that test on any composite. However; it is not possible 
to set up any sensible routine method for using the standard deviation 
in determining the weight of a test. iif one test has a larger standard 
deviation than another test, and this difference seems to be due to factors 
that are largely irrelevant to the reliability and validity of the test, 
Weighting inversely as the standard deviation is probably reasonable. 
If the test with the larger standard deviation is more valid or reliable, 
or if it seems to be reasonable to assume that it would be more valid 
and reliable because it is a longer or a better test, then simply adding 
in the gross scores of the two tests would be a reasonable procedure, 
and weighting inversely as the standard deviations would only help 
to decrease the influence of the best test. On the other hand, if it seems 
that the test with the larger standard deviation owes this extra varia- 
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bility to error variance, some still different weighting scheme that would 
decrease the weight of or eliminate the poorer test would seem reasonable. 


Weighting inversely as the standard deviation is to be avoided 
as a routine procedure. Other factors being approximately 
equal, a composite will be influenced more by a test with a 
large standard deviation than by one with a small standard 
deviation. This higher weight is probably desirable if the 
greater standard deviation is due to such factors as greater 
test length or reliability that contribute to true variance. The 
higher weight is probably undesirable if it seems that the 
greater standard deviation is due to irrelevant multiplying 


factors in the scoring key or to any factors that would increase 
the error variance. 


6. Weighting inversely as the error of measurement 


Weighting each test by the factor 1/ (seV 1 — rg) is a method that 
would be free of the obvious objections that apply to weighting either 
inversely as the standard deviations or inversely as the square of the 
error of measurement. For example, such a weighting would auto- 
matically correct for any arbitrary change in scoring that affected the 
standard deviation of the test without altering its reliability. As the 
true variance increased, or the error variance decreased, there would be 
an appropriate direction of adjustment of the weights. If the test length 
were altered so as to raise the reliability, the weight would be increased. 
As an arbitrary rule of thumb method for use when no criterion is 
available and the tests seem indifferent as far as judgment of content is 
concerned, it would seem that such a system would be appropriate. 


Weighting inversely, as the err 


cally corrects for any arbitrar 
in the scoring system, incre 


or of measurement automati- 
y multiplying factors introduced 


d for this method, it has excellent 
-sense point of view, and is prob- 
ule of thumb method to reco 
when no criterion is available and w 
are not computed. 


mmend 


3 3 hen test 
intercorrelations 


Tn most amateur discussio 
considered are the number of i 
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of the score. It is believed, for example, that if gross scores are added, 
the effect will be to give a 100-item test twice the weight of a 50-item 
test. That such is not the case can be seen for example by assuming that 
the 100-item test was a very easy one on which everyone obtained scores 
ranging from 95 to 100. Adding scores on this test to a student's record 
would then, at the most, make a 5-point difference in the total score. 

If, on the other hand, the 50-item test were composed of fairly difficult 
items and were fairly reliable, it could easily be that scores on it would 
range from 20 to 50. In other words, adding this test would make a 
30-point difference in extreme cases, and a 10- or 20-point difference in 
the majority of cases, so that the total score would agree rather closely 
with the score on the 50-item test and not correlate with the score on 
the 100-item test. 

. From the illustration just given, we also see that the weight a test 
exerts is not related to the magnitude of the average score either. The 
100-item test in the illustration would have a mean of 97 or 98 correct 
answers, and the 50-item test would have a mean in the 30's. The initial 
amateur reaction on seeing two sets of test scores, one set mostly in the 
30’s and the other all in the high 90’s, would be to feel that, if the two 
sets were added, the first would have approximately one-third the weight 
of the second. As we have just seen, the total range of scores for a test 
is the important factor in determining its effect on a composite. 

It can be seen that, of themselves, the test mean, and number of items 
have no effect whatever on the relationship between a test and the 
composite of which it is a part. Both factors should be completely ignored 
in considering weighting problems. 

It might be noted that, if all students do not have the same series of 
tests, the mean score is an important factor. Suppose, for example, 
that the students have the choice of answering questions X or Y, or of 
submitting answers on the X-test or, alternatively, on the Y-test. If 
the X-seores range from 30 to 50 with an average around 40, and the 
Y-scores range from 70 to 90 with an average around 80, clearly the 
students who have chosen Y and not X will get in general 40 more points 
in their total than those who have chosen X and not Y. In such a case 
it is possible to “adjust” by adding 40 points to each person’s X-score 
or subtracting 40 points from each person's Y-seore. However, another 
complication arises here. Can it correctly be assumed that the students 
Who submitted X are on the average identical with those who submitted 
Y? Frequently when alternative choices are given it happens that 
better students tend to pick one and poorer students the other, so that 
equating the average scores is not an appropriate procedure. Neither 
is it correct simply to add the gross scores, which means assuming that 


338 The Theory of Mental Tests [Chap. 20 


these correctly represent the difference between the two groups. In 
general, it is impossible to determine the appropriate adjustment 
without an inordinate amount of effort. Alternative questions should 
always be avoided. The only possible rational solution is in the type 
of methods suggested in the chapter on standardizing tests. A common 
section must be used as the basis for equating the alternate parts. In 
order to use these equating procedures, both the common parts and the 
alternate parts need to be of a reasonable length to secure reliability, 
so that the equating will be reasonably stable for similar groups. In 
the conventional examination, where the student is asked to answer 
any six of nine questions, we are really setting nine different examinations. 
If the examination were given to 150 students, there would be 
only 100 per examination on the average, which would mean that many 


of the combinations would be taken by very few students. The data 


would be inadequate for equating, and the labor would be great. The 
alternative questions should in general not be used. If some choice 
seems unavoidable, the choice should be set up systematically by requir- 
ing a given set of items to be answered by everyone, and then reducing 
the number of possible combinations that can be submitted by requiring 


that the student answer one question from this set, or that he answer one 
of the following three sets of questions. 


The number of items in a test (the perfect score) and the te 
mean have no effect in determinin 
composite and should be ig 
priate weight for a test. 


st 
g the test's influence in a 
nored when considering the appro- 


The only exception arises when 
alternative questions (or tests) are used, in which case we 


must allow not only for the test mean but also for ability dif- 
ferences in the groups making the different choices. 


8. Effect of a subtest on a composite score 

Having considered the effects 
same set of subtests, we may tur 
on a composite score. 


of alternate sets of Weights used on the 
'n to the problem of the effect of a subtest 


, 


each of the parts to the total. In general, such a solution is impossible, 
and will not be given here, 

A simple, direct, and meaningful way to think of the contribution of a 
part to the total is to use the correlation between the part and the total 
as an index of the contribution of t 


; nin the part. Wilks (1938) has suggested 
this method, pointing out that, if each part has the same Pili wid 
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with the total, in one sense each part has the same weight in determining 
total score. It should be noted that using the correlation coefficient as 
an index enables us to define the “same” weight of two tests and to 
define “greater” weight and “less” weight. However, it is not possible 
to say that one test has two or three times the weight of another test. 

Using the part-total correlation as an index of relative weights, we 
are able to speak in terms of equal, greater, or less weight. It does not 
enable us to divide the total into a given number of parts, one for each 
subtest, totaling to 100 per cent; nor does it enable us to speak of 
double or triple weights. However of the various methods that have 
been proposed of assessing the relationship or the “contribution” of the 
part to the total, it is the most generally useful and intelligible. 

The correlation between any part X, and the total Xc, which is a 
weighted sum of the parts, may be expressed as follows. Let 


K 
(67) Xie = WiXa + WeXis d: WeXix = $, WX. 
g=1 


The composite score (X;c) for individual 7 is the weighted sum of his 
Scores on the individual tests. If the X’s are regarded as deviation 
Scores, we may write the correlation of part X, with Xo as follows: 


N 
25 XiX, ic 
mil 


(68) Tec = tue 


Expanding and summing the terms in X;c, we have 


N K 
D D WaiXieXin 


i=l h=1 


(69) hec mp s 
N 


Reversing the order of summation and writing )) X;,Xy, as a covariance 


i=1 
or a variance term gives K 
> Wht ehSeSh 
hal 
(70) fg = (reg = 1). 


SgSC 
If we separate the one variance term from the K — 1 covariance terms, 
and divide numerator and denominator by sg, we have 
K 
Wesg + 25 Wh'erSh 
h=1 


(71) LI EE wenME (h = 9), 
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where rec is the correlation between test g and the weighted com- 
posite, 


Wg (or Wn) is the weight assigned to any test, 
Sg (or Sp) is the standard deviation of the test, 
Tgn is the intercorrelation between two tests, and 
Sc is the standard deviation of the composite. 


Since sc is identical for each of the tests, it may be ignored. We 
see then that the correlation between the composite and any test is 
determined by the weight assigned that test, the standard deviation of 
that test, and the weighted sum of the correlations between that test 
and each of the other tests. 

In particular it should be noted that the test mean Mx and the 
number of items K (that is, the perfect score) have nothing whatever 
to do with determining the correlation of any test with the total com- 
posite. This correlation is determined by the test standard deviations, 
the intercorrelations, and the weights assigned. If the weighting f. 
for a test is increased, the correlation of that test with the composite 
will be increased. If the average correlation of a test with the other 
tests is increased, its correlation with the composite will be increased. 


If the standard deviation of a test is increased, its correlation with the 
composite will be increased. 


actor 


If all the tests in a set have zero intercorrelations, the correlation of 
any test with the composite will be proportional to the product. of its 
assigned weight and its standard deviation. However, in the usual case 
this term will be small in proportion to the weighted sum of the corre- 
lations of that test with all the other tests. 

To show that the factors of tes 
crucial and must be considere 
in Stuit (1947), pages 305-306. 


Aa A 
students’ time was divided sevenths in shop work, 


and two-sevenths class- 


S .ob, 


48. The final grades were 
room work in mathematics, 
sevenths spent in shop work. 


mathematies grades was 


4.1, and of Shop work 2.5. Tt is also interest- 
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ing to note the means of the three sets of part grades. These were, for 
mathematics, 83.6, mechanical drawing 89.1, and shop 84.0. The mathe- 
matics part, with the lowest assigned weight and lowest mean score, 
had the highest actual weight in determining total score because of its 
very high standard deviation. In shop, as the standard deviation shows, 
the vast majority of men received grades in the 80 to 88 range so that 
these grades had little influence on the total, even though they were 
weighted nominally four times as much as the mathematics score. Here 
the problem was to secure a greater spread of grades in shop work. 
Since the grades of different instructors for the students’ shop work 
correlated from —.11 to .55, it was clearly desirable to secure more 
uniform grading methods rather than simply to multiply such apparently 
inaccurate ratings by a factor such as ten or twenty in order to have 
them exert a predominant influence on final grades. Various gages were 
devised to measure the products of shop work quickly and accurately. 
Such increased accuracy in grading the shop work increased the varia- 
bility of shop grades, and thus these grades legitimately contributed 
more to the total score of the students than the classroom work in 
mathematics. 

It is hoped that this illustration will demonstrate that both the corre- 
lation and the standard deviation, as well as the nominal weight, must 
be taken into consideration when making combinations of scores. It 
also shows that it may not be possible to reach an adequate solution of 
the problem simply by altering the weights of the different part, scores. 
It may be necessary to devise new and better tests for certain aspects 
of the work before it is possible to give these aspects their desired weight 
in the total score. 

Equation 71 shows that the correlation between a composite 
(C) and any one of its parts (g) ts completely determined by 
the weights, standard deviations, and intercorrelations of the 
subtests. The test mean and the number of items (or perfect 
score) have no effect on the correlation between part and total, 
unless they influence the standard deviations and intercorre- 


lations. 


9. Use of judgment in weighting tests if no criterion is available 

If several part scores are to be combined to determine a total score 
and no eriterion is available, so that multiple correlation methods cannot 
be used, one method is to use judgment regarding the relative magnitude 
of the correlation desired between the total and the different part scores. 
Before such judgments can be meaningfully made, it is necessary to 
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have considerable information on the interrelationships of the part 
scores. First we note that the total number of items in each test and 
the mean of each test are irrelevant, provided all the students have taken 
each test. The necessary information includes the standard deviation, 
reliability, and error of measurement for each test and the intercorrela- 
tions between the tests so that we may see the kinds of relationships 
already existing in the part scores. With this information at hand, we 
decide on a judgmental basis which of the parts should be weighted 
equal, which higher, and which lower. Wilks (1938) has proposed that 
one definition of “equal” weights be equal correlations with the com- 
posite score. Higher weight means a higher correlation between 
composite and the part; lower weight, a lower correlation. In terms of 
such a definition, we should then decide the relative magnitude of the 
correlations on a judgmental basis. In general, if a test is long and 
reliable, it probably should have a higher correlation with the final 
composite than a test that is short and unreliable. 
must be noted that, if certain subtests intercorrelate highly with each 
other, these tests will necessarily have similar correlations with the com- 
posite. It is possible to decide that they will all correlate high with the 
composite or that they will all correlate low wi 
is not possible to decide that one member o 
and the others low with the composite. Rou 
that a set of highly intercorrelated subtests all “weight” high, or all 
“weight” low in the sense of correlating high or low with the composite 
score. When these decisions are made, the weights are found by solving 
a set of equations of the form 


Furthermore, it 


th the composite; but it 
f the set will correlate high 
ghly speaking, it is necessary 


K 

(72) TecSc = 53 Wrrensn (ree = 1), 
h-i 

in which rc are the desired correl. 


ations determined by judgment, and 
the iV’s are the unknown weights. 


If in a problem of weighting tests in a battery no criterion is 


available, but the test intercorrelations and standard devia- 
tions are available, it is possible to define relative weights in 
de of the correlation of the part with 

judge what the relative magnitudes 
correlations should be, enter these 

solve for the relative weights. It is 


f y the desired correlation with the 
composite. 
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10. Use of factor analysis methods in determining weights of 
subtests 

As a general method for combining subtests we can see that, if the 
battery of tests is factored and a score set up for each factor, we have 
represented a maximum of material in the tests with the minimum 
number of scores. Such methods are dealt with in textbooks on factor 
analysis, and they are beyond the scope of this book. Usually such 
methods are too laborious and time consuming to be adopted for ordinary 
weighting problems. 

It should also be noted that this recommendation results in several 
different scores for each person. If a single grade is to be given or a 
single pass-fail line is to be drawn, the problem of how to combine these 
several factor scores into such a single total score still remains. It is of 
course possible to use judgment in assigning relative weights to the 
different factor scores. However, if judgment is to be used at this 
stage in determining the nature of the composite, it may be almost as 
good to use judgment directly on the subtest scores and avoid the work 
of the factor analysis. In a complex set of tests, however, it is likely 
that the groupings revealed by the factor analysis will make our judg- 
ments simpler and more meaningful. 

In determining a single score for a set of tests, it has been suggested 
that a particular factor score be used as the single score to represent the 
battery. There are several possibilities for selecting this factor score. 

The first principal axis has been suggested by Wilks (1938), Horst 
(1936a), and by Edgerton and Kolbe (1936). Computationally this 
method is quite laborious. It requires a successive approximations pro- 
cedure with even as few as four or five variables. It must also be noted 
that this method is directly sensitive to arbitrary changes in score 
variance. For example, if scores on one test are multiplied by ten, the 
principal axis will swing in the direction of that test. Furthermore, as 
previously indicated, it is of no help to adopt a device such as standard 
scores. Such a procedure would give great weight to short unreliable 
tests in the composite and relatively little weight to a test that was very 
long and accurate. However, provided we are able to fix arbitrarily 
on the appropriate units for each of the subtests, this method has some 
interesting properties. _ 

Horst derived this method by determining the set of weights that 
would maximize the variance of the composite scores (given a fixed 
value for the sum of squares of the weights). Edgerton and Kolbe 
derived the same method by determining the set of weights that would 
minimize the variance of the set of scores assigned to a given individual. 
Wilks derived the same method in seeking to minimize the “generalized 
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variance" for all individuals receiving the same score. The fact that 
these three different approaches resulted in the same method is interest- 
ing. It shows that, provided we fix the units for each test, the largest 
principal axis score maximizes the variance of the composite score, 
minimizes the variance of scores for a given individual, and minimizes 
the generalized variance for all persons receiving the same score. 

Use of the first centroid axis was suggested by Horst (1936a) as an 
approximation procedure. He derived the principal axis solution as 
indicated above but, recognizing its laboriousness, suggested that the 
first centroid axis be used instead as an approximation to the principal 
axis. Using the first centroid axis means making the weights for test g 

K 
proportional to >> rga. This term is easily obtained. 
hal 
the usual problem in factor analysis regarding the selection of an appro- 
priate value for the term r;,. In some solutions this term is given the 
value unity; in other solutions it is set equal to zero, equal to the reli- 
ability coefficient, or equal to some estimate of the communality of the 
test. From the initial discussion of the effects of weighting, we see that, 
if there are few small correlations, the decision on the appropriate value 
for rg, will make a great difference in the composite score. If the corre- 
lations are many and large, the resulting composite score will be only 
negligibly affected by this decision. 

Horst, in recommending the use of the first centroid axis, felt that it 
was an approximation to the longest principal axis. Edgerton and Kolbe, 
however, gave an illustration in which the two were quite different, 
The amount of difference between these two solutions will depend on 
the nature of interrelationships in the battery. 

Using a single common factor as 
battery was suggested by Spe 
method would apply only to a 


"There is, however, 


a guide to the weighting of tests in a 
arman (1927, Appendix, page xix). This 
eth battery of tests that satisfied Spearman’s 
original two-factor theory, which means that one factor is common to 
all the tests in the battery, and in addition each test has its own specific 
factor which is uncorrelated with the general factor and with the specific 
factor in each of the other tests. The correlation between the factor 
specific to x and th } ‘ 


at specific to y may be regarded as the correlation 
between v and y with the gener Since the 


, al factor partialled out. 
specifies correlate zero, we have from the formula for partial correlation 
0- Tay — Teal ey 
or 
Tay = Teasley. 
That is, if the battery is explained entirely by one common factor, the 
4 
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correlation between any two tests is the product of the correlation of 
each of those tests with the general factor. By methods formally 
identical with those of section 4 on weighting by reliabilities, we can 
show that the multiple correlation weights to use in predicting the 
general factor from the tests in a one-factor battery are proportional to 


(73) p 
l = Tez 

for any test (x), where rgs is the factor loading of test x or its correlation 
with the one common factor. The methods of determining whether or 
not a given set of tests is adequately represented by one common factor 
and the methods of determining the correlation with this factor are 
discussed in the various textbooks on factor analysis; see Spearman 
(1927), Thurstone (1935a), (19475), and others. 

These various solutions based essentially on factor theory are interest- 
ing. The principal axis solution is especially interesting since it both 
maximizes interindividual differences and minimizes intraindividual 
differences. It is, however, markedly influenced by arbitrary decisions 
made on test scoring. Similar remarks apply to the centroid solution. 
Where only one factor is necessary to account for the correlations, this 
system will give a unique solution. It should be noted that, from one 
point of view, only tests that form a one-factor system should be com- 


bined into a single score. Where several factors are present, several 


scores should result. ' "E 
The results of any one-factor solution to the weighting problem should 


still be inspected to determine the correlation between the composite 
and the individual tests to be certain that, from a judgmental point of 
view, there is nothing obviously peculiar or undesirable about the solu- 


tion. 
If no criterion is available and the battery of tests turns out 
to have only one common factor, the tests may be weighted to 
give the best prediction of this factor. If the battery is a multi- 
factor one, it is possible arbitrarily to select the longest prin- 
cipal axis or the first centroid axis as the best one to represent 


the entire battery. 


1l. Weighting to equalize marginal contribution to total 
variance 1 
Wilks (1938) has suggested another method of defining and deter- 
mining “equal” weights. He points out that the variance of the weighted 
sum of K — 1 tests will be less than that of the K tests, and suggests 
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that in one sense all the tests are equally weighted if the variance of 
any combination of K — 1 tests is equal. That is, the variance of the 
total test would be equally affected no matter which one of the K 
constituent tests was removed from the composite. 


This method again 
is computationally complex and seems also to have 1 


ittle in its favor. 


12. Weighting to maximize the reliability of the composite 


If no external criterion is available, we may wish to assign weights 


so that the reliability of the weighted composite will be a maximum. 
The solution to this problem has been given for a special case by Mosier 
(1943), and for the general case by Thomson (1940) and Peel (1948). 
The solution given by Thomson can be shown to be equivalent to that 


given by Peel, and since that given by Peel is much simpler we shall 
use it. 


Let the matrix of intercorrelations for the test battery be designated 


1 Uus Tig "nk 
Tyg 1 Toa ToK 
"us fə 1l T3K 
R= 3K 
"IK Tek 13K +++ 1 
Let the matrix of intercorrelations between the tests of the two parallel 
batteries be designated 
Ti M12 "3 "ux 
"12 Toy Teg see T2K 
" 
cm 13 T23 Tas Tak 
TiK ToK Tak TKK 


The off-diagonal entries of C are identical w 


entries of R. The two matrices differ only in 
coefficients, and the other ha 


Let us also define the row 


ith the corresponding 


eii Jn that one has reliability 
S unity in the diagonals, 


vector of Weights, 
W-W, Wo Ws Wr. 
Since the second battery is assumed to p, 

e par: 
matrix of intercorrelations (R) for this battery is eit Md Bede 
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with that for the first battery. Equation 7, the general expression for 

the correlation of two weighted sums, is given in matrix notation by 
WCW' 

WRW’ \/WRW’ 

Since we are dealing with a reliability coefficient, the two batteries of 

tests and two sets of weights are identical. Thus the two factors in the 

denominator are identical, giving 


Rxyxy = 


WCW' 

WRW' 

Since the reliability of the composite will remain the same if all the 
weights are multiplied or divided by any arbitrary factor, another con- 
dition is needed to determine the weights. We may say that the weights 
shall be chosen to make the variance of the composite unity, that is, 
(75) WRW' = 1. 

In order to select weights that maximize equation 74 subject to the 
condition given in equation 75, we define a new function, using the 
Lagrange multiplier (A) as follows: 

(70) Rxwxw = WCW' — X(WRW’ — 1). 
Differentiating equation 76 with respect to each of the IV’s in turn, 
setting each derivative equal to zero, and dividing by 2 gives the set of 


(74) Rxyxw = 


equations 

(77) WC — \WR = 0. 

Postmultiplying both matrices by W’ and solving for À gives 
WCW' 

(78) A= WRW’ = Rxyxy- 


Since A is a scalar, it may occupy any position in a product. Thus the 
solution of equation 77 for W gives 

(79) W(C — AR) = 0. 

Equations 79 have a solution other than W = 0 only if the determinant 
of the coefficients of W equals zero; thus 

(80) [C —3R| =0. 


Equation 80 is a Kth degree equation in à. Since, from equation 78, 
à = Ryyxy, and we are seeking the maximum reliability, we choose À as 
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the largest root of equation 80, substitute this value in equation 79, 
and solve for the relative weights. Using the condition given by equa- 
tion 75 completes the solution for the Weights that will maximize the 
reliability of the weighted composite. 

A simplified formula that gives a principal axis solution has been given 


by Green (1950a). 
13. The most predictable criterion 
By methods analogous to those 


R is the matrix of intercorrelations of tests in the first, battery, 


those of the second. 

V is the row vector of weights (V) to be used for the fir 

W is the row vector of weights (W) to be used for the seco 

Writing equation 7 for th 
notation, we have 

: Vew’ 

(81) R EU A 

VVRV' AAWSW: 

In order to avoid multiple solutions 


weights is necessary, sucl 
of each composite unity. 


(82) 


st battery. 
nd battery. 


e correlation of two Weighted sums in matrix 


» Some other restriction on the 
1 as adjusting the weights to make the 
This corresponds to the restric 


VRV’ = WSw’ = iP 


varianee 
tion that 


Thus we may define a new function using two Lagrange multipliers, 
^ and y, as 


À 
(83) m c 95, = (WSW’ — yy. 


Differentiating with respect to the V's 
derivatives equal to Zero, we have 

(84) 
and 
(85) 


and the W's and setting the 


CW’ — RV’ = 9 


VC — yws = 0. 


Chap. 20] Weighting and Differential Prediction 349 


Premultiplying the first equation by V and solving for A, and post- 
multiplying the second equation by W’ and Solving for y, we find that 


VCW' vcw 
| VRV wsw’ 


(86) Y—À = Rxrxw- 


Postmultiplying both terms of equation 85 by S^! and solving for W, 
we obtain 


1 
(87) W --Vcs7, 
Y 


Substituting equation 87 in the transpose of equation 84 faetoring out 
V, and writing A? for the product y^, we have 


(88) V(CS~'C’ — »?R) = 0. 
By a corresponding procedure we find the solution for W as 
(89) W(C’R7'C — XS) = 0. 


Equations 88 and 89 have a solution for the weights other than zero 
only if the determinant of the coefficients of V and of W equals zero, 
We may write 
| csc’ — *R| =0 
P [CRC - S| =0. 
Equations 90 are polynomials in A2. Since equation 86 shows 
M = R?xyxy and we are seeking the maximum correlation between 
the weighted composites, we choose the largest root of equation 90, and 
use this value in equations 88 and 89 and the conditions given by 
equation 82. This procedure completes the solution for the weights 
that will maximize the correlation between the two weighted composites, 

Setting C = C’, R = S, and W = V in equations 88 and 89 gives 
the solution for maximizing battery reliability given by Thomson ( 1940). 
Such a solution of the problem of maximizing battery reliability is 
equivalent to the much simpler solution of equation 79. The following 
procedure for demonstrating this equivalence was suggested by Dr, 
L. R. Tucker of the Educational Testing Service. 

Postmultiplying both terms of equation 77 by R-!C gives 


WCR7'C = AWC. 
Multiplying both sides of equation 77 by X gives 
AWC = XWR. 
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Thus we have the solution derived from simplifying equations 88 and 89, 
W(CR7'C — X?R) = 0, 


where A? equals R?x,,x,,- Since the two solutions are equivalent, the 
simpler one indicated in equation 79 is to be preferred. 

The weights given by equations 88 and 89 constitute a mathematically 
very elegant method of weighting to secure a maximum correlation. 
However, it should be noted that, unless used with discretion, the 
procedure of determining the most predictable criterion has certain 
dangers. For example, if one of the criterion measures and one of the 
predictor measures happen to be tests of the same factor, both batteries 
are likely to be weighted to correspond with that factor. That is, if 
one test of spatial visualization is used in the prediction battery, and 
another test of spatial visualization in the criterion, while verbal and. 
quantitative factors are represented in only one of the batteries, the 
most predictable criterion procedure is likely to result in warping the 
criterion to represent primarily spatial visualization. The blind accept- 
ance of such a result would mean that there would be no eff ort to repre- 
sent the other factors, such as, for example, verbal and quantitative in 
the predicting battery. In an extreme case such a procedure could mean 
that all factors in the criterion that were initially omitted from the 
prediction battery would always be omitted since they would receive 
very little weight in the criterion. 

The remedy here, as in all other uses of mathematical procedures, 
is to inspect the results to see if they have any peculiar characteristics. 
Any set of tests that have low weight for the criterion should be inspected 
to see if they would be regarded by experts in the field as being important 
and deserving of an important place in determining the total criterion. 
The prediction battery should be inspected to see if an attempt has 
been made to include the type of ability required by the criterion vari- 
ables that received low weight in determining the comp 
we should alter the variables entering into the tw 
do not seem to be appropriate. 


osite. In general 
0 batteries if the results 


Another way of stating the caution given in the foregoing par: 
is to say that a mathematical method should be adopted only t 
between alternatives that are judgmentally very similar. 
if all the criterion variables are indifferent so that the 
accept any set of weights positive or negative, the most predictable 
criterion results need not be questioned. However, if the expert judg- 
ment is that any set of positive weights are acceptable, it would seem 
proper to use the most predictable criterion only if all weights were 
positive. The expert judgment might be even more restrictive. For 


agraphs 
o choose 
For example, 
expert would 
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example, if the criterion measures were grades and tests in college, and 
if these measures appeared to involve three types of abilities—verbal, 
quantitative, and spatial—the faculty might judge that any composite 
was acceptable as long as the verbal and quantitative factors had weights 
distinctly higher than those of the spatial factor. Given this judgment, 
we could use the most predictable criterion if the weights fell in the area 
indicated. Otherwise the problem for the technician would be to alter 
the criterion or predictor variables in some reasonable and acceptable 
fashion so that the weights would have reasonably appropriate relative 
values. In general, we may say that mathematical procedures are 
appropriately used when they serve to guide thought. If an attempt 
is made to utilize such routines as a substitute for thought, we may 
unwittingly arrive at and accept absurd conclusions. 


14, Differential tests 
Sometimes when a battery of tests is given, the problem is to obtain 
a single score from the battery. This implies that the battery is a one- 
factor battery or is to be treated as a one-factor battery. Sometimes 
it is desired to obtain several different scores from the battery. This 
implies that the battery represents several different factors, and a score 
is to be obtained for each of the factors. In this latter case the best 
procedure is to determine the factors present in the battery, and then 
to use the scores that best predict the factor scores of each individual. 
It may happen when this procedure is followed that the factor scores 
finally obtained still intercorrelate rather high so that, instead of having 
a set of differential scores, we have scores that in large part give different 
patterns of ability only through incidental errors of measurement. 
Whenever a set of supposedly differential scores are set up by factor 
analysis or other methods, it is desirable to make a check on the scores 
finally proposed to determine the extent to which such scores will give 
valid differentiation of different scores for the same individual. i 
When the accomplishment quotient (A.Q.) was introduced, Kelley 
pointed out that the problem involved was to obtain reliable measures 
of each variable. Clearly these measures would be correlated, so that 
the accomplishment quotient might reflect only errors of measurement. 
Kelley (1923a) proposed a method for testing the extent to which two 
tests are giving differential scores to a set of persons. This method makes 
use of both test reliability and intercorrelation to determine the per- 
centage of scores that will show a reasonable difference and the 
percentage that will show such a difference solely through errors of 
measurement. The first percentage should be considerably larger than 
the second if the scores are to be used for their differential value. 
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If we use x and y to designate deviation scores in the two tests under 
consideration and the subscript ¢ to designate the true scores in these 
variables, the difference between the observed score difference and the 
true score difference constitutes the error with which a difference in 
scores is measured. 

(91) ea = (x — y) = (u — y). 

By rearranging terms, the error of the difference may be written in 
terms of the error in x and in y, as follows: 

(92) ea = (2—2) — (Y — Y) = ez — ey. 


Since the errors in x are independent of the errors in y, the sum of 
squares may be written 


(93) Deg? = De,” + Xe. 
uy 


Expressing the error in x and y in terms of the test reliability and 
standard deviation, we have 


(94) $e = Zea? /N ES sz = Tz) + sy (l — Ty). 


The magnitude of this term which is the variability of difference due to 
error may be compared with the total variability of differences. 


Z(r— y)? Ze? By? Way 


95 Lay = 
(n d N N N N 


From the equations for variance and correlation, we have 
(96) Pay = 82° + 8,7? — WrySz8y. 


If the tests x and y are expressed in standard score, the standard devia- 
tions and variances become unity, and equations 94 and 96 become 
respectively, i 


(97) Sea” = 2 — Taz — ty = AL — F) 
and 
(98) Pey = 2(1 — £4), 


where 7 is the average reliability of the two tests. 
As the average reliability becomes markedl 
correlation of the tests, the dispersion of obta 
greater than that obtained by chance. 
Kelley (1923a) proposed using normal curve 
from equations 97 and 98 to find the percentage of 


y larger than the inter- 
ined differences becomes 


proportions derived 
Observed differences 
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in excess of that which could be expected to occur by chance because of 
the error of measurement. ; 

To put this material in more familiar form, equation 21, Chapter 17, 
shows us the reliability expressed as a function of the error variance 
and total variance. Equations 97 and 98 in this chapter and 21 of 
Chapter 17 give 


(09) j 2(1 — 7) 
TL ee ae 
E 2(1 — rz) 
Simplifying, we have 
F — tay 
(100) Tz—y = , 
1 — fay 


where 7;...,, is the reliability of the difference between x and y, 

Try is the correlation between tests x and y, and 
7 is one-half the sum of the reliabilities of tests z and y. 
A similar equation is given by Conrad (19445), page 7. 
For any pair of tests, equation 100 gives the reliability of the 
difference as a function of the intercorrelation of the two tests 
(ry) and the average of the two reliability coefficients (7). 


Figure 3 is a linear graph showing the nature of this relationship. 
To use this computing diagram, mark the diagonal line corresponding to 
the intercorrelation of the two tests (rz). Locate the average reliability 
at the bottom of the chart, then move up to the diagonal line and over 
to the scale at the right showing the reliability of the difference. In the 
illustrative problem (shown by the heavy dashed line in Figure 3), if 
the average reliability is .6 and the intercorrelation is .5, the reliability 
of the difference is only .2. It should be noted that, if the average 
reliability of the two tests is about the same as their intercorrelation, 
the reliability of the difference is approximately zero. In order for the 
reliability of the difference to be .8, when 7;, is .5, the average reliability 
of the tests must be .9. f ; , 

In setting up a profile type of battery in which differences in a given 
person's score on different tests is important, we must be certain that the 
reliability of the difference in scores is fairly high before giving this 
difference much weight in the interpretation. Kelley (1923a) suggested 
that this reliability figure be interpreted in terms of the percentage of 
Observed difference scores in excess of that which could be expected to 
occur by chance because of the errors of measurement. If this percentage 
were very small, the difference score would. not be very useful. The 
interpretation in terms of reliability is more conventional; also the error 
of measurement can be obtained so that differences less than one or 
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Ficure 3. Showing the relationship between the average reliability of two tests 
their intercorrelation, and the reliability of the difference between the two tests, 


Equation (100) Try = E 


two errors of measurement will not be in y indicat: 
difference in ability in the two tests. ee indicating a real 

Many profile tests furnish no information on the reliability of tl 
differences in scores. Such information should be required as el ronis £ 
part of the validation and standardization of any battery to be on 
as a profile. 

The Differential Aptitude Battery of the Psychological Cor oration’ 
has been set up in this manner; see Bennett (1947) or hen and 
Doppelt (1948). 
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Brogden (1946a) has presented a method for determining cutting 
scores in connection with the problem of differential prediction. A 
more complete analysis of the problems of differential prediction is 
given by Tucker (1948), Thorndike (1949), and Mollenkopf (1950). 


15. Summary 

Whenever a single total score is to be derived from a number of sepa- 
rate scores, the weighting problem cannot be avoided. However, if 
many different scores with reasonably high intercorrelations are being 
combined, the resulting composite will be fairly similar for a large 
variety of weights. If, however, relatively few items are to be combined, 
there is a low correlation among these items, the standard deviation of 
the distribution of weights being considered is fairly large, and the 
correlation between two sets of weights is low, the two resulting com- 
posites will be different. The correlation between two composites 
obtained by using different weights on the same set of scores is given by 


Equation (15) 
K(1— 7) (Cvw + VIV) + (K? — K)Cqm. + K?2VW;r 


= KU- A+A Ka- AWHA ' 
+ (K? — K)Cwv)r + (K? — K)Corwy, 
+ VF + Wr 
where K is the number of scores to be combined, 


iV and V are the averages of the IV and V weights, 
W and P are the standard deviations of the two 
sets of weights, 
Fis the average intercorrelation of the 
scores to be combined, 
Cyw is the covariance between the two sets 
of weights used, 
represent the covariance of a product of 
weights with the corresponding corre- 
lation rga, and 
Rxyxy 38 the correlation between the composite 
obtained using the V-weights and that 
obtained using the W-weights. 


Cwr Cov and Carine 


Equation 15 shows us that, if 7, V, and W are positive, the correlation 
Rxyxy approaches unity: (a) as the correlation between the two sets 
of weights is increased or (b) as the standard deviation of the weights 


356 The Theory of Mental Tests [Chap. 20 


(V, W) is decreased in proportion to the mean weights (Y and W). 
Also, if the covariance terms of the form Cww)r may be ignored, 
Rxyxy approaches unity: (c) as K approaches infinity or (d) as F ap- 
proaches unity. 

Some appraisal of the magnitude of Ryxyxw for positive weights may 
be obtained from 


à e- 1 YN (Wr 
(47) Xy = lo [o j^ s | i 


where R is an approximation to the mean value of the correlation be- 
tween two weighted composites, and the other terms have the defini- 
tions indicated for equation 15. 

Equations 15 and 47 show us that, if a few scores with low intercorrela- 
tions are involved, and if also we are considering alternative sets of 
weights with large variance and low intercorrelation, weighting will 
make an appreciable difference, and the selection of a “best” set of 
weights is important. For such a case we may say: 

1. If a criterion is available, the multiple correlation weights indicated 
by equations 52 and 53 for the general case and by equation 56 for the 
three-variable case will give the best results, in the sense that the 
correlation between the weighted composite and the criterion will be a 
maximum (see equation 55 for the general case and 58 for the three- 
variable case). For practical purposes, simple integral approximations 
to the exact multiple weights will usually give a satisfactory composite 
score. 

2. If many variables are involved, and particularly if a selection is to 
be made among these variables, some approximation to multiple correla- 
tion as indicated in section 3 is to be preferred to the exact method. 

3. Where no criterion is available, various weighting methods have 
been adopted: 


(a) Weighting in terms of the average score or the perfect score, which 

is usually equal to the number of items in an objective test, is 
always to be avoided. There is no justification for the belief that 
these factors have or should have any effect on the weight of an 
item in a composite. 
Weighting inversely as the standard devia; 
as the error variance, that is, by the fac 
suggested in the literature on testing. Bot! 
on a rationale involving assumptions t 
satisfied in practice. Also there are m. 
either method will give results clearly in 


(b 


— 


tion (1/s,) or inversely 
tor r/(1 — r) has been 
h these methods depend 
hat are probably never 
any situations in which 
appropriate. 


i 
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(c) Weighting to equalize “marginal” contributions to total variance 
has been suggested (see section 11). This method also has peculiar 
properties. 

Weighting inversely as the error of measurement was discussed 
in section 6. No rationale is at present available for this method, 
and it seems not to have been used or suggested before. From a 
common-sense point of view, this method has a valuable set of 
properties; and, of the different rule-of-thumb methods presented 
in this chapter, it is the one that would seem to be most generally 
acceptable. 


(d. 


4. Where no criterion is available, it is probably best not to use any 
rule-of-thumb method. Two alternatives are suggested here. 


(a) We may weight the items so as to maximize the reliability of the 
composite by using the matrix formula 


(79) W(C — AR) = 0, 


where R is the matrix of intercorrelations among the variables (with 
unity in the diagonals), 
C is the same matrix with the substitution of reliabilities for unity 
in the diagonals, 
W is the vector of weights, and 
A is chosen as the largest root of 


(80) |c-R|-0. 


(b) We may depend on expert judgment for the determination of 
weights. For a system of correlated variables there is no satis- 
factory method of assessing the proportional contribution of each 
component to the total. The best guide is the correlation of each 
part with the composite. The correlation of any part (g) with the 


composite (C) is given by 
K 


Wese + 22 Warensn 
h=1 
(71) fgg = " (g = h), 
where reo is the correlation between part g and the weighted 


composite (C), 
W, (or Wh) is the weight assigned to any part (g or h), 
Sg (or sn) is the standard deviation of that part, 
Tgn is the correlation between two parts, and 
sc is the standard deviation of the composite. 
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In equation 71 the correlation between part g and the composite is 
shown to depend entirely upon (a) the weight of that part (W,), (b) the 
Standard deviation of that part (sẹ), and (c) the sum of products 
Wargrsr for all other parts entering into the composite. 

In judging what weights are to be assigned to the different parts, we 
must understand clearly that these are the only factors determining the 
"weight" of the part in the composite, and that the correlation between 
the part and the composite is the best criterion to use in judging the 
effective weights of each part in the composite. If the judges will assign 
certain relative values to the correlations 7c, the solution of equation 72 
for W, will give the relative weights for the different parts. 

5. If multiple eriterion and predictor measures are available, we may 
select weights so that the correlation between the two composites will 
be maximized. These weights are given by the matrix equations, 


(88) V(CS-!C' — XR) = 0 
and 
(89) W(C'R^C — XS) = 0, 


where R and S are the matrix of intercorrelations among the variables 
of each set, 
C is the matrix of correlations of variables in one set with 
those in the other set, 
V is the set of weights to be applied to the variables of the 
R-matrix, 
W is the set of weights to be applied to the variables of the 
S-matrix, and 
M is chosen as the largest root of 
|CS-'c' - *R| =0 
[CRC — 28 | = o. 

The weights given by this system are so flexible that w. 
be led to a composite criterion that is undesirable fr 
point of view. When using “the most predictable crit, 
sary to inspect the weights carefully from the point of 
judge in order to avoid accepting an unreasonable eri 

6. It has been suggested that the methods of facto: 


in solving the weighting problem. Where a single s 
mined, it has been suggested: 


(90) 


e may easily 
om a judgmental 
erion” it is neces- 
view of the expert 
terion, 

r analysis may aid 
core is to be deter- 


(a) That the first principal axis be used a. 


: 8 the best, representative 
of the set of scores. This method has a 


number of very interest- 
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ing properties. It maximizes the interindividual differences, 
minimizes the differences between the various scores obtained by 
a given person, and minimizes the “generalized variance" for all 
individuals receiving the same score. Despite such a set of 
properties, however, it has two serious disadvantages. It is 
laborious to compute, since it necessitates a successive approxi- 
mations procedure, and it is sensitive to the units in which the 
various tests are measured. 

(b) That the first centroid axis be used. It may bea good possibility 
if some one score is desired to represent a set of scores. 

(c) That, if the set of scores is actually a one-factor system, the one 
common factor would seem to be a very good choice for the 
composite score. Spearman suggested this solution, and gave the 
equations for it. However, a set of tests must be very carefully 
selected if it is to contain only one factor. This rule, therefore, 
could be applied in only a very few cases. 


If several scores are to be derived from a set of tests, the best procedure 
would be a factor analysis procedure. The battery should be factored, 
and a score assigned for each principal factor that is determined. It 
should also be noted that, whenever several different scores are assigned 
to each person in a group, and differential use is made of these Scores, it 
is necessary to assess the reliability of the score differences. This relia- 


bility is given by 
(100) Tey = , 


where 7;_, is the reliability of the difference score, 
Try is the correlation between the two tests, and 


T is half the sum of the two reliabilities. 


A computing diagram for this equation is given in Figure 3. Equation 
100 shows that, unless the average reliability of two tests is considerably 
higher than the correlation between them, the differences will be very 
unreliable. This means that in making differential predictions, or in 
interpreting profiles, judgments will usually be made on the basis of 
accidental score differences. Unless 7z_, is .80 or larger, valid judgments 
of individuals cannot be made on the basis of score differences between 
tests x and y. All differential prediction batteries or batteries that are 
to be used as profiles should give information on the reliability of the 
difference score for each pair of tests in the battery. 
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Problems 


1. What is the expected value of the correlation be 
from a battery of forty tests, with an average inte 
only positive weights are used and that the 
standard deviation of the weights. 


tween any two composites 
reorrelation of .30? Assume that 
average weight is about equal to the 


2. On the foregoing assumption regarding weights, what is the expec: 
of the correlation between any two composites from a batte 
average intercorrelation of .20? 


ted value 
ry of five tests with an 


DATA ror PROBLEMS 3 TO 7 
Entering freshmen at the University of Chicago are given an A. C. E. Psy- 
chological Examination (a), a physical sciences aptitude test (s), an English 
placement test (e). A year later they are given the physical science comprehensive 
(p) and the humanities comprehensive (A). 


The following zero-order correlations are obtained: 
Tas = .50, rac = .70, Te = -40, 
Tap = .90, Tap = .70, rep = .40, 


Tah = .60, ren = +20, Teh 


I 


-70, Tph = .60. 


The following means and standard deviations are found: 


p h 
Mean 120 110 150 220 460 
Standard deviation 30 20 25 30 40 


3. Write the equation for makin 


g the best prediction of t 
hensive score from the three place: 


he humanities compre- 
ment tests, 


4. What will be the correlation bi 


etween the predicted h 
actual scores, using the prediction e 


€ 1 ; umanities scores and the 
quation given in 3? 


5. Which two placement tests will give the best prediction of Scores in the physical- 
science comprehensive? " 


6. Write the equation for making the best, predietion of the physical-science 
comprehensive score from the two tests mentioned in 5. 


7. What is the correlation betwe 


en the actual 
scores predicted by using the equatio 


; : Physical-science scores and the 
n given in 6? 
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Dara ron PnonLEws 8 To 17 


Formula for P 
d iras Mean gia Transformed pie = eee 
of Items ev. nes ility elations 
(K) (X) | (X) (Y) Tzz * 
Testa 10 6.1 1.3 Y —10X .62 Tab = .36 
Test b 50 35.6 | 4.2) Y=xX .83 Tac = .42 
Test c 200 153.7 | 15.8 Y=X/2 .95 Toc = .65 


X, Y, and z scores for the following problems are defined in the foregoing 
table. 


8. For the data given in the table, discuss the desirable and undesirable charac- 
teristics of a composite score formed by weighting Xa, X», and Xe according to the 
reliability of each test as indicated in equation 66. Would the composite be the same 
or different if the Y-scores or the z-scores were weighted as indicated in equations 
65 or 66? 


9. Give the desirable and undesirable characteristics of a composite formed by 
weighting the X-scores inversely as the standard deviation of the test. Would the 
composite be the same, or different, if the Y-scores or the z-scores were weighted 
according to the same principle? 


10. Give the desirable and undesirable characteristics of a composite formed by 
weighting the X-scores, Y-scores, and z-scores inversely as the error of measurement. 


11. Give the desirable and undesirable characteristics of a composite formed by 
weighting the X-scores, Y-scores, and z-scores inversely as the error variance. 


12. Give the desirable and undesirable characteristics of a composite formed by 
weighting the X-scores, Y-scores, and z-scores directly as K, the number of items in 
the test. 


13. Give the desirable and undesirable characteristics of a composite formed by 
weighting the X-scores, Y-scores, and z-scores inversely as K. 


14. Give the desirable and undesirable characteristics of a composite formed by 
weighting the X-scores, Y-scores, and z-scores inversely as the test mean. 


16. Judging in terms of the correlation of the part with the composite (equations 
71 or 72), how would a, b, and c be weighted by (a) adding X-scores; (b) adding 
Y-scores; (c) adding z-scores; (d) taking the composite, 7’ = 10X. + 235.41 X. 
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16. What weighting factors should be assigned to the X-scores in order to obtain 
a composite t, such that raz, 741, and ret are approximately in proportion to 2, 3, and 4, 
respectively (see equation 72). 


17. What is the reliability (a) of the difference score Xa — Xy; (b) of the difference 
score Xa — X.; (c) of the difference score X, — X,2 


18. Prove that gross score weights that are inversely proportional to the error 
variance for gross scores are identical with the weights of equations 65 and 60. 


19. (a) ri the standard score weights that are inversely proportional to the 
error variance of a standard score. Compare this weight with that of equation 65. 
(b) Determine the gross score weights that will give results identical with these 


standard score weights, and compare these gross score weights with those of equation 
66. 


* 


2] 


= 


Item Analysis Ld 


1. Introduction 

Basically, item analysis is concerned with the problem of selecting 
items for a test so that the resulting test will have certain specified 
characteristics. For example, we may wish to construct a test that is 
easy or one that is difficult. In either case it is desirable to develop a 
test that will correlate as high as possible with certain specified criteria 
and will have a satisfactory reliability. The index of skewness should 
be positive, negative, or zero for a specified population. If a battery 
of several tests is being constructed, it may be desirable to have the 
intercorrelations as low as possible. It is also of considerable interest 
to. be able to construct a test so that the error of measurement is a 
minimum for a specified ability range or so that the error of measurement 
is constant over a wide ability range, as is assumed in the development 
of formulas for variation in reliability with variation in heterogeneity 
of the population (see Chapters 10, 11, and 12). In each of these situa- 
tions it would be convenient to be able to write the prescription for item 
selection so that we should be able to subject a set of K items to an 
appropriate type of analysis, and then to select the subset of I: items 
that would come nearest to satisfying the desired characteristics. 

As yet the rationale of item analysis has been developed for only a 
few of the problems indieated. Numerous arbitrary indices have been 
devised and used. Twenty-three methods are listed and described by 
Long and Sandiford (1935). Nineteen methods are summarized by 
Guilford (19365) in Psychometric Methods, pages 426-456. With one or 
two exceptions, these lists are essentially the same. For earlier surveys 
of item analysis methods, see Cook (1932) or Lentz, Hirshstein, and 
Finch (1932). The striking characteristic of nearly all the methods 
described is that no theory is presented showing the relationship between 
the validity or reliability of the total test and the method of item 
analysis suggested. The exceptions, which show a definite relationship 


between the item selection procedure and some important parameter 
363 


S. 
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of the test, are Richardson (19362); the method of successive residuals, 
Horst (19345); the use of a maximizing function, Horst (19365); the 
L-method and its various modifications (see Toops, 1941, Adkins and 
Toops, 1937, and Richardson and Adkins, 1938). 

In developing and investigating procedures of item analysis, it would 
seem appropriate, first, to establish the relationship between certain 
item parameters and the parameters of the total test; next, to consider 
the problem of obtaining the item parameters in such a way that they 
will, if possible, not change with changes in the ability level of the 
validating group; and, last, to consider the most efficient methods, from 
both a mathematical and a computational viewpoint, of estimating 
parameters for the items. 

The method of item selection used and the theory on which it is based 
must be directly related to the method of test scoring. For the usual 
aptitude or achievement test, the responses to each item may be classified 
as either correct or incorrect, and the item analysis procedures utilize 
this information. For items relating to personality, interests, attitudes, 
or biographical facts, the responses cannot be classified as either correct 
‘or incorrect. A set of such items demands a more complex type of 
item analysis procedure that not only gives information on item selection 
but also furnishes a scoring key. If an achievement or aptitude test is 
scored in terms of “level reached,” it would seem appropriate to use the 
item analysis methods of absolute scaling (see Thurstone, 1925 and 
1927b) or some other pic a method. Such procedures do 
not seem appropriate for the usual test that is scored by counting the 
number of correct responses. In this chapter we shall consider only 
those item analysis procedures suitable for the case in which the item 
responses may be classified as correct or incorrect and in which the score 
is the number of correct responses. 

Another consideration that will affect item analysis methods is the 
extent to which the group available for item analysis purposes is similar 
to or different from the prospective test group. For example, a group 
of students in a college with high admission standards might be the only 
group available for experimental purposes for a test that is to be generally 
used for college admission. In this case item information from a group 
of high ability is to be used in constructing a test to be used for a group 
with a lower average ability and a larger variance in ability. Other 
variants of this problem may arise. For example, considerable item 
analysis data may be available on a large population of applicants for 
college admission, and we may wish to use this information in selecting 
items suitable for a scholarship examination that is to be taken only 
by superior students. The item selection problem is clearly much 


these 
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simpler when the item analysis group and the prospective test group 
are similar in mean and variance on the particular ability to be tested. 

It is important to note that, while the item analysis rationale and the 
quantitative item selection procedures are the same for aptitude and 
achievement tests, there is one important difference. In the construction 
of aptitude tests the item statistics may be allowed to control the 
rejection and selection of items more fully than in the construction of 
achievement tests. The judgment of the subject matter expert must 
always play an important part in the selection and rejection of items 
for an achievement test. If the item analysis results show that a given 
item should be used, and the expert finds that the item is incorrect, 
that item must be revised. If the item analysis results show that an 
item should be deleted, and the subject matter expert feels that essential 
knowledge is being tested in the item, then attempts must be made to 
discover the flaw in item construction and revise the item so that it 
will satisfy both the item analysis criteria and the judgment of the 
subject matter specialist. It may even be that the fault lies in the 
teaching methods used so that the item that is unsatisfactory from the 
viewpoint of item analysis statistics will show satisfactory item analysis 
results for a new class that has been taught differently. In an achieve- 
ment test the goal should be to obtain items that are satisfactory from 
the viewpoint of both the item analysis results and the subject matter 
specialist. In order to do this, it may be necessary to revise the item, to 
revise the criterion against which the item is validated, to revise the 
methods of teaching, or the content of the course. 

Relatively little of a precise nature is now known regarding the effect 
of item selection on test skewness, kurtosis, or on the constancy of the 
error of measurement throughout the test score range. It is possible, 
however, to select items in such a way as to influence the test mean, 
variance, reliability, and validity. We shall now consider item selection 
in relation to these four test parameters for tests that are scored by 
counting the number of correct responses and are composed of items the 
responses to which are either correct or incorrect. It will also be assumed 
that the item analysis group and the prospective test group have similar 
means and variances of the ability to be tested. 


2. Item parameters related to the test mean 
Let A;, designate the score of the ith person on the gth item. As 
shown in Table 1, Aig is unity if the ith person answered 
the gth item correctly, and zero if the answer was incorrect. 
N be the number of persons taking the test, (i = 1 --- N), and 
K be the number of items in the test, (g = 1 --- K). 
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Since each person’s score is the number of items correctly answered, 
we may write 


K 
(1) Xi = 2 Ais. 
g=1 
That is, each person’s score is the sum of the entries in a row, as shown 
in Table 1. 
TABLE 1 
Items 
(gh, 21... K) 
1 2$ 8 AH K 
1 1 10 0 0 X, 
2 1 0 1 1 I X 
3 0 2 4 gy 0 Xs 
“4 0 O° D 1 1 Xa 
Individuals 
Gji =1-+-N) 
N OL 0 à x @ Xn 
Sums dı des dg d, dk Xd XX 
, Id Ou 
Sums +N p p P3 ps DK Ww wt Mx 


The test mean is given by 


i=l 
2 Mi = 2 
(2) x 


Substituting equation 1 in equation 2 and noting that the grand total 


given by adding the row sums is the same as that given by adding the 
column sums, we may write 


N K K N 
22344 YA, 
eyes i=l g=1 _ £=l i=l 
(3) Mx * 7 


For any given item g the item difficulty is defined as the proportion of 
correct responses. Designating the difficulty of item g by Pe, we have 


Pe 
2 As 
i=1 

(4) Pe = N 
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Substituting equation 4 in equation 3, we write 
K 
(5) X or Mx = > p, = Kj, 
g=1 
where M x is the test mean, also designated X, 
Pz is the proportion of correct responses for the gth item 
@=1--- K), 
K is the number of items in the test, and 


K 
pis (1/X) >> pg, the average item difficulty. 
g=1 

If test score is taken as the number of correct answers, the test 
mean is equal either to the number of items multiplied by the 
average item difficulty or to the sum of the item difficulties, 
when item difficulty is defined as the proportion of correct re- 
sponses. 


It should be noted that equation 5 holds only if “correct response" 
and "incorrect response" are defined in the same way for both test scoring 
and item analysis purposes. For example, if the score is “number right," 
items answered incorreetly, items skipped, and items omitted will each 
count zero in determining total score. They must then be similarly 
counted when obtaining pg. Table 1 shows that we have assumed a 
matrix of “1’s” and “0’s.”” These terms are added by rows to determine 
the score of each person, and the same terms are added by columns to 
determine item difficulty. If the test is a power test, item difficulty 
defined as the proportion of correct responses will represent a charac- 
teristic of the item in relation to the ability of the group. If the test is a 
speed test, pg is entirely or primarily a characteristic of the position of 
the item in the test and the timing of the total test. For a speed test, 
“proportion of correct responses" does not represent a characteristic of 
the item; hence this type of analysis is inappropriate insofar as a test is 


speeded. 


3. Item difficulty parameters that compensate for changes in 

group ability 

Several measures of item difficulty have been suggested that allow 
for the possibility that the item analysis group may be different from 
the prospective test group. 

Thurstone’s difficulty calibration method (Thurstone, 1947a), which he 
has used in the construction of the American Council on Education 
Psychological Examination, is the simplest and most direct method of 
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compensating for possible changes in the ability level of the group 
whenever a new set of items is given to a new item analysis group. 
Consider the situation in which test Y given to group B is to be equated to 
test X, which has previously been given to group A. About twenty items 
from test X, which are well scattered over the total item difficulty range, 


are affecting that item, and its behavior is not indic 
differences. 

A normal curve transformation of percentage correct, has been sug- 
gested by several authors, Ayres (1915), Thurstone (1925), Thorndike 
(1927), Bliss (1929), Symonds (1929), Horst (1933), and others. A 
variant of this method used by the College Entrance Examination Board 
was devised by Brolyer, and is described by Brigham (1932), page 356, 
Brolyer’s index, called delta (A), was set up to take care of the problem 
posed by time limit tests in which only the superior 
items at the end of the test. Asa result, we do 


the item correctly had they attempted it. 

The College Entrance Examination Board proced 
each person a linear derived score (w) on the total test. This score is 
used as the criterion against which each item is evaluated. For the 
group attempting an item, we find the mean (m) and the standard 
deviation (s) of the total test score in terms of the w-seale, The number 
of persons answering the item correctly is divided by the total number 
attempting that item to determine the Percentage correct, (p) This 


percentage is then converted into delta (A), a base line reading on the 
w-scale, by the equation, 2 


Ure is to assign 


(6) A= my + Sup, 
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where 2, is a base line normal deviate corresponding to p (if p > .50, 
£y <0; if p < .50, z > 0), 
Mw is the mean w-score of the group attempting the item, 
Sw is the standard deviation of w-scores for this group, 
A is the desired standard measure of item diffieulty, and 
pis the number answering the item correctly divided by the 
number attempting the item. 


The most serious objection to this method of indicat ing item diffieulty 
is that it takes no account of the correlation between the item and the 
totalscore. Given a certain set of values for m, $, and p, the value of A 
is the same regardless of whether the correlation between item and total 
score is .00, .30, or .60. Other writers have presented a method of 
transforming p to a linear scale that is influenced by the item criterion 
correlation. 

A regression. line transformation for P (percentage correct) has been 
suggested by Thorndike (1927), Bliss (1929), and others. The purpose 
of this method is to find the ability level at which half the persons will 
pass the item, and half will fail. It is analogous to the cutting score 
method described in Chapter 19, section 13. The regression of the 
normalized item score (designated z) on the criterion score (designated x) 
is used. This is written 


s bu. 


where rsz is the biserial correlation between item and total test score 
or some other criterion, 
Z is the standard deviation of the normalized item score, which 


is taken as unity, and 
€ is the standard deviation of the criterion score. 


It is desired to find the criterion score x», which corresponds to the 
point at which lialf the persons would fail and half pass the item. This 
is the point x», which corresponds to the line between those passing 
and those failing the item. This point on the z-scale will be designated 
Zp. It is equal to the normal base line equivalent of the percentage 
passing the item (p). If p is greater than -50, z, is negative; if p is less 
than .50, zp is positive. Substituting z, and z, in equation 7 and solving 


explicitly for x, gives z 
(8) * Tp € Z] 2p; 


where x, is the criterion score level at which half the persons will fail, 
and half pass the item, 
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Zp is a normal base line equivalent of p, the percentage correct 
for the item (if p > .50, zp <0; if p < .50, zp > 0). The 
other terms have the same definition as in equation 7, Z being 
taken as unity. 


The item criterion curve has also been suggested as giving an indication 
of the ability level at which half the persons would fail and half pass the 
item. In this method the group taking the test is divided into five, 
ten, or twenty subgroups on the basis of some criterion, usually the 
totaltest score. "These groups are taken as representing various ability 
levels. Then the percentage correct on a given item for each of these 
Subgrowps is computed. In general it is found that only a small per- 
centage of the lowest group gets the item correct and that a larger and 
larger percentage of each succeeding ability group gets the item correct. 
From this information we can determine by interpolation (or extrapola- 
tion, in the case of very easy or very difficult items) the ability level at 
which half the persons would answer the item correctly and half in- 
correctly. "This level then represents the criterion level at which half 
fail and half pass the item, and is taken as indicating the item difficulty. 
If the assumptions for a biserial correlation coefficient are met, this 
method will give results identical with those obtained by equation 8, 
Since its purpose and method of procedure are essentially identical. 

The four types of methods just discussed, Thurstone’s method of 
calibrating item difficulty, the normal curve transformation as repre- 
sented in equation 6, the transformation based on the regression line 
as shown in equation 8, or the use of the midpoint of the item curve, 
may all be regarded as attempts to find an item difficulty parameter 
that is invariant with respect to changes in the mean or dispersion of 
the ability of the group. As far as the author is aware, there is no pub- 
lished experimental evidence to show how well any of these methods 
succeeds in its purpose. The first and last methods are simple and direct, 
involving no assumptions such as those in equations 6 or 8. However, 
if the assumptions of biserial correlation are justified, it would seem that 
the method represented by equation 8 is best since it makes use of all 
the available data to determine the item difficulty level. 

If the total test score is to be determined by counting the number of 


items answered correctly, it does not seem particularly appropriate to 
measure item difficulty in terms of criterion le 


6, 8, and the item curve method. Such measur i 
seem appropriate for a test that is to be scored 
or for a test that is constructed by the absolu 
stone, 1925 and 1927b). However, if these i 
terms of criterion level turn out to be relati 
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to changes in group ability level, it should be possible to translate them 
into different "percentage correct” scores corresponding to the particular 
group to be tested. 


4. Estimates of the percentage who know the answer to an item 
Other measures of item difficulty have been devised to estimate the 
percentage of persons in the group that “know” the answer to the item, 
as distinct from those who guess, and guess correctly. 
Guilford (1936a) has suggested that the usual method of correcting 
for chance be applied to items as well as test scores. This method 
involves two assumptions. 


1. That the persons can be divided into two groups, (a) those who 
know the answer and (b) those who guess the answer. 

2. Those who guess are equally likely to select any one of the alterna- 
tives given. 


Let f designate the number of different answers given for an item, 
then 1/fth of those who "guessed" would guess correctly, and (f — 1)/f 
would guess incorrectly. Since this latter group includes all who 
answer incorrectly (by assumption 1 above there is no misinforma- 
tion leading to the incorrect answer), 1/(f — 1) of those who answer 
incorrectly is equal to the number of lucky guessers; hence, subtracting 
(Number wrong)/(f — 1) from the number right will give the number 
who got the right answer not by guessing but by knowledge. The 
percentage who know the answer (designated p') may be written 


(9) pa-p } 


where R; is the number of correct answers to the item, 
W; is the number of incorrect answers to the item, 

f is the number of possible answers given for each item, 

T is the total number who tried the item [T' may be considered 
equal to rights plus wrongs (R; + W;) or may also include 
those who skipped the item] and 

p' is an estimate of the percentage knowing the answer to that 
item. 


It should be noted that one implication of this method is that the 
same number of persons will select each of the incorrect alternatives, 
and that some number greater than this will select the correct alternative. 
Investigation of any multiple choice test will show that rarely, if ever, 
are all the distractors equally attractive. Horst (1933) has suggested 
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an item difficulty measure for multiple choice items that assumes that 
the different distractors are unequally attractive. 

Horst (1933) makes the two assumptions indicated for equation 9, 
and in addition he assumes that those who do not know the correct 
answer fall into various subgroups. The first subgroup is composed of 
those who know nothing about the alternatives in question; hence the 
members. of this group are distributed equally to all of the J possible 
answers. A second group is composed of those who know that one of the 
alternatives is wrong, hence distributes its answers uniformly over the 
remaining f — 1 choices, and so on. The next-to-the-best group knows 
that all but two of the alternatives are wrong, hence distributes its 
choices evenly between the correct answer and one of the incorrect 
choices. The best group is composed of those who know the right 
answer and those who know that each of the other choices is wrong, 
hence pick the right answer by elimination. According to this reason- 
ing, the number of persons in this last group is equal to the number 
choosing the correct alternative minus the number who mark the most 
popular incorrect alternative. 

Let us consider what would happen to a five-alternative item. Let 5a 
designate the number of persons knowing nothing. Since they dis- 
tribute equally among the five alternatives, a persons will choose each 
of the five alternatives. Let 4b designate the number who know that 
one of the alternatives is wrong; b of them will choose each of the other 
four answers. The next group is designated by 3c, c of whom will choose 
the correct answer and ¢ of whom will choose each of the two most popu- 
lar wrong answers. Assume that 2d persons know enough to avoid all 
but one of the distractors, hence divide equally between it and the 
correct answer. Finally we have e persons who know the right answer 
or else know that all the others are wrong; d 


; hence all these e will pick 
the correct answer. Let us use W4 to designate the number picking the 


poorest distractor, Ws for the number picking the next most 


and so on up to W;.', for the number picking the most popular di 
Then we may write 


popular, 
stractor. 


Wi =a, 
Ws —a-4 b, 
Ws — a4 b-4 ec, 


Wi —a-crb--c4-d, 
R=at+b+ct+d+e, 


Thus we have 
e= R — W4. 
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In general we see that the number of persons who know the correct 
answer is equal to the number marking the correct answer minus the 
largest number selecting any one of the incorrect answers. If we desig- 
nate the corresponding estimate of the percentage knowing the correct 
answer by p", we have 
es ps 

(10) Mic cq 
where R is the number of persons selecting the correct answer, 

Wy_1 is the number selecting the most popular incorrect answer, and 

T is the total number of persons responding to that item. 


This method has the distinct advantage over equation 9 that it takes 
account of the fact that different numbers of persons will pick the differ- 
ent distraetors in an item. It also furnishes a criterion for the possible 
presence of actual misinformation. According to the theory, more 
persons will select the correct alternative than will select any of the 
incorrect alternatives. This is a fact as a consequence of the assumption 
that any subgroup with a given amount of information will distribute 
equally among the alternatives they do not know to be false. In ampli- 
fying this theory, allowance should be made for chance variations from 
such a distribution. We may say, however, that, if a considerably 
greater number of persons select one distractor than select the correct 
answer, it is likely that some actual misinformation exists in the group, 
and the method indicated in equation 10 does not apply. A method 
of test scoring appropriate for the measures of item difficulty shown in 
equations 9 and 10 has not been suggested. 


5. Item difficulty parameters—general considerations 

Innumerable other measures of item difficulty have been suggested 
that are based on the percentage correct for the upper and lower K 
per cent of the population; see Cook (1932), Lentz, Hirshstein, and 
Finch (1932), Guilford (19360), Kelley (1939), and Davis (1946). The 
upper and lower k per cent are chosen on the basis of total test score, 
and k has been given various values such as 10, 20, 25, 27, 33. Such 
difficulty measures are usually incidental to methods for obtaining a 
rapid approximation to the correlation between item and test score. 
Insofar as they are measures of item difficulty, they are regarded as 
approximations to the basic statistic of percentage of persons answering 
correctly. In general, the proper method of evaluating a statistic that 
is an economical approximation to some other statistic is 


1. To determine the standard error or confidence interval for each of 
the statistics. 


— 7 
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2. To determine how many cases must be used for each method in 
order to give statistics of equal precision.! 

3; To determine the dollar cost of each method for the number of 
cases indicated in step 2. 


Thus the expense of obtaining statistics of equal precision is deter- 
mined, and the cheaper method may then be advocated. As far as the 
author is aware, none of the statistics indicating item difficulty or item- 
total correlation have been subjected to such theoretical and experi- 
mental comparisons. Thus we do not have the only type of information 
that is relevant for judging the relative merits of the different short-cut 
methods. 

In summary, then, we may say that the methods of item analysis 
should be considered as a part of the total test theory problem. The 
theoretical relation between the item parameters and test parameters 
should be shown. In the test theory presented here the number correct, 
is the score, and, since the mean test score is the sum of the proportion 
of correct responses for each item, there is a very simple relationship 
between item difficulty and test mean provided item difficulty is meas- 
ured as the proportion of correct responses. 

The only other difficulty measure that is consistently related to a 
method of test scoring is the median ability level for the item. This 
measure of item difficulty is appropriate for tests set up and scored by 
methods of absolute scaling. 

The other measures of item difficulty have been set up to cope with 
special problems, such as change in ability level of the group, the problem 
of guessing, or the problem of inadequate clerical help, necessitating 
abbreviated methods. Theoretical and experimental information ade- 
quate for evaluating these methods is not yet available. 

There have been several empirical studies that show that tests com- 
posed of items answered correctly by about 50 per cent of the group 
have a higher validity than tests composed of items that are easier or 
harder than 50 per cent, but otherwise of the same type. 
ample, Cook (1932), T. G. Thurstone (1932), and Richard 
In section 8 of this chapter, an equation showing the rel 
tween item parameters and test validity is develo 
This equation does not show any direct relationship between test validity 
and item difficulty. Test validity, however, does depend on the point- 
biserial item-criterion correlation. This correlation may increase 
rapidly, as items approach a 50 per cent difficulty level; see Carroll 


See, for ex- 
son (1936). 
ationship be- 
ped (equation 24). 


1 The paper by Mosteller (1946) illustrates a good theoretical c s 
different methods of estimating a parameter. omparison of several 
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(1945) and Gulliksen (1945). Hence it is suggested that the higher 
validity found for tests composed of items with 50 per cent difficulty 
may be due to and directly measured by the increase in item-criterion 
correlation. 


6. Item parameters related to test variance 

Another item analysis problem is selecting items in order to control 
the standard deviation of the total test score (s+). We may, for example, 
wish to select a subset of k items out of a total of K items in such a way 
as to have a k-item test with the largest possible standard deviation, 
the smallest, or so that its standard deviation will equal as closely as 
possible that of another test. 

Equation 9 of Chapter 7 gives the variance of a composite as the sum 
of all the terms in the variance-covariance matrix. If the complete 
variance-covariance matrix were available for a set of items, it would 
be possible to add the variances and covariances for different possible 
subsets of items and to find the variance of total test score for each 
possible subset of items. For any large number of items, however, the 
amount of labor required to do this is very great. The procedure 
usually seems impractical with present computational facilities. 

We can obtain a reasonably useful result by working with the correla- 
tion between the item and total test score. From equations 3 to 7, 
Chapter 7, we learn that, if a composite gross score is formed by adding 
gross scores of parts, the deviation score for the composite is the sum 
of deviation scores for the parts; hence from equations 1 and 5 we have 


K K 
(11) a= Xp X= D (Aig — Pe) = 25, aig; 

g=1 g&=l 
where x; designates the deviation score for the test, and 


aig designates the deviation score for the item. 


Designating the standard deviation of item g by sẹ, that of the total 
test by sz, and the item-test correlation by Tsg, we may write 


N 
(12) NrzgSg$s = b Tillig. 
i=1 


Substituting equation 11 in equation 12 and reversing the order of 


summation gives 
K N 


(18) NrgSg8z = 25 25 Gig ig. 


h=1 i=1 


Note that it is necessary to use two different subscripts (h and g) to 
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indicate that, for a given item g, we take the cross products with all 
items (h = 1 to K, including g). Since the terms of the form Zaja;/N 
indicate an interitem covariance, we may divide both sides by N and 
write 


K 
(14) TagSgSz = >. TehSeSh (rgg = 1). 
k=l 


Since g is a number from 1 to K, and h varies from 1 to K, there will 
be one term in the summation where h = g. This term will be a variance, 
and the other K — 1 terms will be covariances. To indicate this ex- 
plicitly, we write 
K 
(15) TreSgSz = Sg? + 22 TghSgSh (h ¥ g), 
hal 


where s, and s; are item standard deviations, which may be written 
V» — p), 
Tgn is the fourfold point correlation of items g and h, 
Tzg is the point-biserial correlation of item g with the total 
test composite x, and 
Sz ls the standard deviation of total test score. 


In other words, the sum of the terms in any one column (or row) of 
the interitem variance-covariance matrix is the covariance between 
that item and the total test score. By using the gross score formula for 
variance and covariance, these results may be expressed in terms of the 
proportion answering an item correctly and the proportion answering 
both items of a pair correctly. From equation 11 and the definition of 
covariance we have 


N N 
Nrasgsy = 25 (Aig — py) (Au — p) = D AigAin — Npgps. 

i=l i=1 
Since the term Z4A;,4;, is zero if either factor is zero and is unity if 
both factors are unity, the summed products are equal to the number 
of persons answering both items g and h correctly. This may be verified 
with the help of the illustrative table of scores (Table 1) 
N, we have the proportion of persons answering both 
which will be designated pgn. Thus the interitem coy. 


Dividing by 
g and h correctly, 
ariance is 

(16) TghSgSh, = Pgh — PePh. 


For the variance of an item, we have the special ¢ 
in which h = g. In this case pj; becomes Pegs 
Pg; thus we have 


case of equation 16 
which is identical with 


(17) Se = De = De” = py — pj). 
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The item-test covariance shown in equations 14 and 15 may, by the use 
of equations 5 and 16, be written 


K 
(18) TrgSgSz = 2 Pen — M xpe (Pee = Pe), 
h=1 
where p, is the proportion of persons answering item g correctly, 
Pen is the proportion of persons answering both item g and item h 
correctly, 
M x is the mean of the total test score, and the other terms have the 
definitions indieated for equation 15. 


Substituting equation 15 (Chapter 21) in equation 9 (Chapter 7) and 
designating the test variance by sz“, we have 


" 
a9) s =D ss. 
2 g=1 
The sum of the item-lest covariances is equal to the sum of the 
terms in the interitem. variance-covariance matrix, which is 
equal to the test variance. Thus the test variance is expressed 
in terms of item parameters. 


Since s, is a constant when summing over g, the right-hand side of 
equation 19 may be written s,Zrz,8,. Dividing both sides by s, gives 


K 
(20) L or s = Dy rarse = Kiras). 
g-l 
Define the product r;,8; as the “reliability index” for item g. 
Then the standard deviation of the total test score (designated 
Sz or X) is equal to the sum of the item reliability indices. 


It should be noted particularly that no approximations were used in 
deriving equations 19 and 20. The only possible reason for either of 
these equations failing to work in any particular case is the occurrence 
of an arithmetical error in the calculations. It should also be noted 
that, in terms of the derivation, rz, must be a point biserial correlation. 

Unfortunately, however, these equations hold exactly only for the 
standard deviation of the total test. For a subtest made up of a subset 
of items, the sum of the item reliability indices based on correlation of 
item with tolal test score will not exactly equal the standard deviation 
of the subtest. For example, if the interitem correlations are nearly 
equal and all positive, the sum of the reliability indices for half the items 


K/2 
in the test (X nn) will give a value larger than the standard deviation 
g=1 
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of the test composed of half the items because for approximately parallel 
items the correlation of an item with a longer test will be greater than 
its correlation with a shorter test. This may be seen from equation 5, 
Chapter 9. 

However, a test composed of items with large reliability indices will 
probably have a greater standard deviation than one composed of items 
with small reliability indices. Also if the items in two tests are matched 
simultaneously with respect to two item parameters such as Trg and Py 
(since s, is a function of pg, see equation 17), the two tests will have 
closely comparable means and standard deviations. Refer to the 
method sketched in Chapter 15, section 7, and Figure 3 of Chapter 15. 
We shall see in the next section that the reliability of a test is determined 
by the item variances and interitem covariances, together with the 
number of items, so that matching two tests item for item with respect 


to both rz, and p, would give tests with similar reliabilities as well as 
similar variances. 


7. Item parameters determining test reliability 
The equation showing the relation between number 

variance, item reliability index, and test reli 

substituting equation 20 (Chapter 21) in 


This gives 
K 
K b Se? 
g=1 
= TES T = Jj "TE sal? 
( 25) rate) 
g=1 


where K is the number of items in the test, 
8,” is the item variance which equals p, 
77,5; 1$ the item reliability index, and 
Tzz is the reliability of the total test. 


of items, item 
ability may be written by 
equation 10 (Chapter 16). 


= 2 
De’; 


If we write a sum of terms as K times the average, and divide numera 
tor and denominator by K, we have | 


_(_K tA 
(22) r= (z) l=, 
K(rzgs,) 


where s,” is the average item variance, and 


TrgSq is the average item reliability index, 


and the other ter 
the same definitions as in equatio: naa 


n 21. 
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The item variance (sg? or p; — pg”) approaches zero as p, approaches 
zero or unity, and is a maximum value of .25 when p, — .5. Since the 
values of s,? vary between zero and .25, the average item variance 
must also be between these limits. The value of sẹ varies between 0 and 
.5, and the value of rzg between 0 and 1. Thus the reliability index 
must lie between 0 and .5. That is, the average item variance and the 
average reliability index vary within narrow limits; hence these factors 
cannot have much influence on the test reliability unless rz, is near zero, 
in which case the denominator will become small and the reliability will 
be low. On the other hand, K, the number of items, increases uniformly 
with the addition of new items. As can be seen from equation 22, the 
effect of this change in K is to move the reliability nearer to unity. The 
number of items is of itself an important determiner of reliability. As 
long as we avoid items that have a very low or negative correlation with 
total test score, the addition of items with low positive correlations will 
usually increase the reliability of the total test. 


Equations 21 and 22 give the test reliability as functions of 
the item reliability index (rrg8g), the item variance (s,”), and 
the number of items (K). 


If the number of items composing the test is fixed, the reliability of 
the test can be increased only by making the average item variance 
smaller or the average item reliability index larger. To make such a 
selection of items graphically, each item is represented by a point, the 
ordinate of which is the item variance (s?) and the abscissa of which 
is the reliability index (r;,s;). In order to maximize the test reliability, 
we must select a subset of points such that the average ordinate is as 
small as possible and the average abscissa is as large as possible. This 
means that the points must be selected from the lower right-hand portion 
of the graph. 

It should be noted that equations 21 and 22 are strictly accurate if 
all the points, that is, all the items in the test, are used. If we consider 
a subset of items that is only a half or a third of the original number of 
items, it is likely that the values of rz, for the total test will be different 
from the values of rz, for the subtest. Thus using equation 21 and the 
values of rz; for the total test will give an over- or an underestimate of 
the reliability of the subtest. However, tests that are matched item for 
item on the basis of both item variance (s,”) and item reliability index 
(rzg8g) will probably have closely similar reliabilities. A subset of a 
given number of items selected for large reliability index and small item 
variance will have a higher reliability than a test composed of the same 
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number of items that have a small reliability index and 
variance. 

Note that, if we desire to select a subset of I: items from a total group 
of K items, a completely accurate solution is obtained by using the inter- 
item variance-covariance matrix, computing the sum of the diagonal 
elements (Zs,?) and the sum of all the elements for various subsets of 
items, and selecting the one subset of size k that has the highest relia- 
bility. However, with current methods of computation, this method is 
considered too laborious to be of practical use. The approximation 
indicated by the use of equation 21 is, however, computationally 
feasible and reasonably accurate if the purpose is to eliminate the 
poorest 10 or 20 per cent of the items. 

Numerous arbitrary indices of the relationship between item and test; 
score have been developed. Adkins (1938) has pointed out that these 


indices may be classified as approximations to some one 
statistics: 


a large item 


of three 


1. The item-test correlation, 
2. The slope of the regression of test on item. 
3. The slope of the regression of item on test. 


The first type would be illustrated by the use of various correlation 
coefficients, such as the biserial, the point biserial, or the tetrachoric; 
the second by the use of indices that depend on the mean difference in 
test score between those passing and failing the item; and the third 
by indices dependent on the slope of the item curve (see Ferguson, 1942, 
Finney, 1944, or Turnbull, 1946). Some of the Suggested indices are 
attempts to decrease the clerical and machine costs of item analysis 
by using only a part of the data; see, for example, Kelley (1939), Flan- 
agan (1939a), and Davis (1946). 


8. Item parameters determining test validity 


Having considered item selection in relation to test me 
and reliability, we turn now to the problem of selecting item 
the validity of the total test score. o this directly 
unless we have information regarding t i ach item with 
the appropriate criterion score. In most practical cases it is probable 
that selecting items to increase Will also inci- 
dentally increase test validity. 9, shows that 
increasing test length increases the validity of the test. Intréssta test 
length is also an effective means ility as 3 es 
in equation 22 (of this chapter) a 8. H Shown 
special cases have been demons le dn on 


an, variance, 


Sto maximize — 


n esc. 
———— ———— 
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validity while increasing reliability or to increase validity at the expense 
of reliability; see, for example, Cook (1932), Tucker (1946), or Brogden 
(1946b). In other words, if no criterion is available it is highly desirable 
to take steps to increase test reliability; however, in laying a theoretical 
foundation for improving test validity it is essential to consider the 
correlation of each item with a criterion. 

Theoretically the problem of maximizing test validity for predicting 
any specified criterion has been solved. We have only to obtain the 
complete interitem variance-covariance matrix and the item-criterion 
covariances, and then solve all multiple correlations or all multiple 
correlations using a specified number of items (equation 55, Chapter 20). 
Frisch (1934) has described the method for dealing with ‘complete 
regression systems.” Such methods, however, are generally regarded 
as too laborious for present computational procedures. Several approxi- 
mation techniques have been devised as indicated in Chapter 20, sec- 
tion 3. All these methods have in common the assumption that the 
best single test (or item) is included in the best two; that the best two 
will be included in the best three, and so on. By such methods we work 
only K — 1 multiple correlations for K items, which is laborious but 
feasible. Such procedures have been described by Horst (1934), Edger- 
ton and Kolbe (1936), Adkins and Toops (1937), Wherry (1940), Toops 
(1941), and Jenkins (1946). However, it would seem that most test 
workers still consider the labor of these methods prohibitive, since they 
have not attained very wide use. It is possible by using additional 
assumptions to develop a less laborious method that makes use of only 
2K item parameters, namely, a reliability index and a validity index 
for each of the K items of the original experimental test. 

The general formula for the correlation of a criterion with a composite 
is given in equation 1, Chapter 9. Here we will use the subscript y to 
designate the criterion instead of 7 as in equation 1, Chapter 9. The 
formula for the variance of a sum is given in equation 9, Chapter 7. 
Here we shall use the subscript x to designate the total test, instead of c, 
as in equation 9, Chapter 7. If we change subscript c in equation 9, 
Chapter 7, to z, change subscript 7 of equation 1, Chapter 9, to y, and 
substitute equation 9, Chapter 7, in equation 1, Chapter 9, we have 


K 
x TygSgSy 
1 


SzSy 


(23) Try = 


Since s, is the same for all the terms in the summation, it may be 
factored out. If we divide numerator and denominator by sy, and 
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substitute equation 20 in equation 23, we have 


K 
2, TygSg 
=i 


(24) Try = : 


K 
= TzgSg 


g=1 


If we substitute K times the mean for the sums and divide numerator 
and denominator by K, we have 


(25) Kay = 2 


where the bar over a term indicates the average, 
Tzg is the point biserial correlation of item g with the test %; 
Tyg is the point biserial correlation of item g with the criterion [A 
8g is V p(1 — p), the standard deviation of item g, 
Try is the correlation between the criterion and test, and 
K is the number of items in the test. 


Tf ry;5; is defined as the “validity index” of item g and Tasty 
as the “reliability index” of item g, the test validity is the 
ratio of the sum of the validity indices to the sum of the reli- 
ability indices, or the ratio of the average validity index to 
the average reliability index. 


As a practical item selection procedure it is desirable to plot the item 
analysis results. For example, the reliability index may be plotted as 
the abscissa and the validity index as the ordinate (Figure 1); then the 
items should be selected as far as possible from the upper left-hand corner 
of the plot. This method was described and illustrated by Gulliksen 
(1944) and (1949a); see Figure 2. 

This method of selecting items to give a valid test is similar to the 
one suggested by Horst (1936b). It is of particular interest to note that 
the number of items in the test has, of itself, no effect on validity. 
However, an increase in number of items will, except under unusual 
circumstances, increase the reliability of the test. If no validity index 
is available, increasing the number of items in a test may well contribute 
to lowering the test validity. 

As mentioned in the introduction to this chapter, it should be noted 
that the methods presented here do not consider sampling errors nor 
the possibility of systematic variation in the item parameters. A subtest 
composed of only a few of the most valid items is probably less likely to 
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maintain its high validity on a new sample of persons than a test com- 
posed of a large number of items. The chance and systematic fluctua- 
tions of the various item analysis parameters need to be studied and 
compared for various item analysis methods. 

In using equations 24 or 25 for item selection, we should note that the 
validity index for an item is independent of the effects of item selection. 


A 


Validity index 
o 


-3 


-3 -2 -1 E 2 3 


0 
Reliability index 
Figure 1. Illustrating plot of validity index and reliability index for item selection. 

(From Gulliksen, 1949a.) 


On the other hand, the reliability index will change as the items compos- 
ing the test are changed. This effect, need cause no concern if only a 
few of the poorest items are eliminated from the test. However, if we 
wish, for example, to select a test of 100 items from an initial test of 
500 items, it is well to make the selection in two or more stages, as sug- 
gested by Horst (19360). If all the item-test correlations are positive 
and high, the selection is not so likely to change the reliability index as 
if there were quite a few items with negative reliability indices that were 
to be eliminated. In such a case the reliability indices should be recal- 
culated after the first elimination of items with low and negative reli- 


ability indices. 
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As mentioned in conjunction with item selection to control test 

. variance and test reliability, it should be noted that, if we consider the 
entire test, the ratio of the average validity index to the average relia- 
bility index must equal the test validity. No approximation is involved. 
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However, as we make more and more s 
correlation of the item with the new su 
different from the correlation of the it 
The item selection introduces no error whatever into the numerator 
term of equation 24. The error made in estimating the validity coeffi- 
cient is due solely to the fact that the correlation of item with total test 
will vary as the test length changes. Hence, as mentioned before, if à 
computationally feasible method of utilizing the interitem variance- 
covariance matrix were developed, it would be possible to select any 


tringent selection of test items, 
btest is increasingly likely to be 
em with the original total test. 
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subset Æ from a larger set of K items and to determine precisely the 
variance, reliability, and validity of that subset of items. 


9. Computing formulas for item parameters 

From the theory given in the preceding sections, the essential item 

statistics are: 

1. pg, the proportion of persons answering each item correctly. This 
quantity is a measure of item difficulty. From it, the item variance 
Sg? = p,(1 — Pe) can readily be computed. 

2. r,,5,, the reliability index, which is the point-biserial correlation 
between item and total score multiplied by the item standard 
deviation. 

3. rysy, the validity index, which is the point-biserial correlation 
between item and criterion score multiplied by the item standard 


deviation. 


Having determined the item parameters that are related to test 
mean, variance, reliability, and validity, we turn to the problem of 
computing these values. 

We shall not consider here short-cut methods of estimating these 
parameters from a portion of the data. The principal purpose of these 
methods is to avoid the clerieal labor involved in dealing with all the 
data; hence they can be compared only on the basis of computing costs 
and statistical precision. As yet such comparisons have not been made. 
For a description of such methods, see Kelley (1939), Flanagan (1939a), 
and Davis (1946). 

The item difficulty measure requires simply a count of the number 
of correct answers to each item. This count may be made manually or, 
if punched-card equipment is available, the count may be made with 
the counting sorter or the tabulator. Usually the count is obtained 
incidentally in connection with the computation of the point-biserial 
correlation or the reliability index. 

When some of the persons taking a test fail to answer certain items, 
we have the problem of how to treat such responses. As indicated in 
Chapter 17, if we are dealing with a speed test all the items must be 
easy so that the only purpose of an item analysis is to eliminate items 
with a significant proportion of errors. In a power test, the number of 
items left blank, either skipped or unattempted, should be negligible. 
An adequate theoretical analysis of a test that is a mixture of speed and 
power has not yet been presented. Such an analysis probably requires 
some information or assumptions about the correlation between Speed 


and power. 
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The analysis given here applies strictly only to a power test that has 
an ample time limit so that practically all the items are attempted. 
The discussion of Chapter 17 indicates objective criteria for the possible 
influence of number of blank items. For analyzing a power test with a 
large number of unattempted items, some of the methods given in sec- 
tion 3 of this chapter should probably be used. 

The derivation showing the relationship of item parameters to test 
reliability or validity, used the Pearson product-moment correlation of 
the item with the test or criterion score. The raw score for an item is 
unity if the item is answered correctly, and zero if the item is answered 
incorrectly. Let us begin with the formula for correlation in terms of 
summations and make the simplifications appropriate for this particular 
case. Since the formula for the item-criterion correlation is identical 
with that for the item-test correlation, we shall consider in detail only the 
correlation between a dichotomously scored item and the test score X. 


Equation (26) 


N N N 
N 21 AaX; — 21 Ai D Xi 
i=l i=1 


i=l 


N N 2 N N 3 
N 3 Ai? — (E As) Jv par (x x) 
i=l 


i=l i=l i=l 


Tag = 


N 
2 AigX;i may be simplified by noting that A is either unity or zero; 


hence the sum of products is equal to the sum of the test scores for those 
N, 

who answer the item correctly. This sum may be designated as T Kans 
ici 

Let us define N, as the number of persons answering item g correctly 

and X, as the average test score for those who answer item g correctly. 

From these definitions and equation 4 we may write 


N 


(27) Ng = P, Ai = Np, 
i=1 
and 
N N: 
(28) » AyX;= 5, Xie = N,X_ = Np,X,. 
i= i=1 


From the definition of a standard deviation, 


N N 2 
(29) Ns, = Jr Ene -(¥ 4s). 
ii 


t=1 
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Substituting equations 27, 28, and 29 in equation 26 and multiplying 
both sides by s, gives the item reliability index in terms of gross score 


summations as 
N 


Ne 
NY Xe- ND X: 


i=l i=l 


(30) Trg$g = ES NENEENEI ENES f 
" Jv xxe-(xx) 


iml l 


The reliability index may also be written in terms of means, a propor- 
tion, and a standard deviation. From the definitions of a mean (X) 
and a standard deviation (X) we have 

N 
p X;= NX, and 


i=l 


N N 2 
NX = Ns: = Jv > x2 - s x) : 
i=l 


i=l 


(31) 


Substituting equations 28 and 31 in equation 30, dividing numerator 
and denominator by N?, and factoring out pg, we have 
X,—X 
(32) TzgSg = Del —>— )] 
X 

If Y is used to designate gross scores on the criterion measure, by 
substituting Y for X in equations 30 and 32, we have the corresponding 
formulas for the item validity index. 

N 


Ne 
N 2) Yi — Ne 2 Y; 
t=] i=] 


(33) TygSg = SS 
w Jn ve- (Ev) 
i=l i=1 
and! 


Y,- Y 
(34) TygSg = Pe w) 


In equations 30, 32, 33, and 34: 
N is the total number of persons taking the test, 
N, is the number of persons answering item g correctly 
((=1--- K), 
Pg iS N,/N, the proportion of persons answering item g 
correctly, 
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X; and Y; designate, respectively, the test and criterion score for 
individual 7 (i = 1 --- N), 
X and X are, respectively, the mean and the standard deviation of 
the total distribution of test scores, 
Y and Y are, respectively, the mean and the standard deviation of 
the distribution of criterion scores, 
X, and Y, designate, respectively, the test and the criterion score 
for each person who answers item g correctly, 
X, and Y, are the average test and criterion scores, respectively, for 
those answering item g correctly, 
7,8; is the reliability index for item g, and 
TugSg is the validity index for item g. 


The reliability and validity index for each item can be computed if 
we have the mean and the standard deviation of all N persons for both 
the test and the criterion, pg, the proportion of persons answering each 
item correctly, and the average test and average criterion score for those 
answering each item correctly. 

The formulas given here for the point-biserial correlation are a slight 
variant of those presented by Richardson and Stalnaker (1933) and 
Stalnaker (1940). 


Equations 30, 32, 88, and 34 are the basic computing for- 
mulas to be used in calculating the reliability index and the 
validity index for a group of items. They are analogous ex- 
cept for the factor Y or X to the formulas presented by Horst 
(1936b). 


10. Summary of item selection theory 


The basic theoretical problem for item analysis procedures is to find 
a functional relationship between the parameters of the total test and 
appropriately selected item parameters. Such a theory must take due 
account of important changes in methods of test scoring. It is then 
necessary to investigate various factors that produce variation in these 
item parameters, such as random sampling error and systematic varia- 
tion produced by changes in such factors as the length of the test and 
the heterogeneity of the group. Various computational short-cut 
procedures utilizing only a portion of the data can also be studied to 
determine which method is most economical. In making such compari- 


sons it is necessary to adjust the sample size so that the statistics com- 
pared will have the same sampling fluctuation. 
In the foregoing sections an i 


tem analysis rationale has bi f ed 
for the case in which the r e Eo 


test score is the number of items answered 
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correctly. It has been shown that the test mean, standard deviation, 
reliability, and validity may be estimated from three item parameters, a 
difficulty, reliability, and validity index. The equations are as follows: 


K 
(5) Mx or X= >) p= Ka, 
g-l 
K — 
(20) s; or X = Do reese = K(ree8e); 


99 "n (s ) a. 3 


» g=1 
(22) "m (l K ) | (sg^) |; 
ical K GaSe) 
K 
(24 E — _ (yee) 
and 25) tay = (ras) 


In these equations: 


K is the number of items in the test, 
N is the total number of persons. taking the test, 
N, is the number of persons answering item g correctly 
(g=1---K), 
Pg is the proportion © 
i (that is, pp = Ne/N), 
s¿? is the variance of item g [se a pl is pz); TT 
TreSe, the item reliability index, is the point-biserial item-test 
correlation multiplied by the item standard deviation, 
the item validity index, is the point-biserial item-eriterion 
correlation multiplied by the item standard deviation, 
M x or X is the mean for all scores in the test distribution, 
sz or X is the standard deviation of the distribution of test scores, 


Tex is the reliability of the test, and i 
is the test validity, the correlation of test (x) with the 


f persons answering item g correctly 


TygSgy 


oy 
criterion (y). . 
a 


390 The Theory of Mental Tests [Chap. 21 


Computing formulas for the item reliability and validity indices were 
given as 


Equations (30 and 32) 


Equations (33 and 34) 


Ne N 
N Yi — N Yi = _ 
p Li. A ý Y,- Y 
TysSg = N 37 Ps £ 


In these formulas: 


X or æ designates the test, 
Y or y designates the criterion, 
Y and Y are, respectively, the mean and the standard deviation of 
the criterion scores, 
X, designates the test score only for those persons who have 
answered item g correctly, 
Y, designates the criterion score only for those who have 
answered item g correctly, and 
X, and Y, are the average test and criterion scores, respectively, for 
those answering item g correctly. 


The other terms have the same definition as in equations 24 and 25. 

One systematic error in the foregoing formulas arises from the fact 
that the item reliability index is not invariant with respect to test length. 
In most cases the item-test correlation will increase as the test length 
increases. The item difficulty and validity indices are not affected by 
test length. All three indices are affected by a change in the ability 
level of the group. This means that the item parameters must be ob- 
tained on a group similar to that for which the test is being constructed. 
The item parameters will be more generally useful if it is possible to 
discover parameters that do not vary systematically with changes in 
the mean or variance of group ability. If such parameters cannot be 
found, it may be possible to make some empirical estimations of the 


amount of change that may be expected in the item parameters as à 
result of a given change in the group. 
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In addition to systematic changes of item parameters with group ability 
and with test length, the item statistics are subject to random sampling 
variation. The magnitude of such fluctuations should be determined 
in order that we may estimate the change in test parameters to be 
expected when the test is used on a new sample. These sampling errors 
could also be used to determine the size for the item analysis sample 
that is necessary to give reasonable sampling stability in the test param- 
eters. 

Numerous arbitrary indices of item difficulty and reliability have 
been given in the item analysis literature. The attempts to express item 
difficulty in terms of an ability level at which the item will be answered 
correctly by half the persons are interesting in that one of them may 
give a difficulty index that does not vary systematically with changes 
in the ability level of the group. Horst’s difficulty index, which estimates 
the number of persons knowing the answer to the item, may also offer 
some interesting possibilities for test construction and scoring. 

A large number of arbitrary indices of item reliability or homogeneity 
have been reported in the literature. Adkins (1938) has shown that 
these indices may be classified as estimates of (1) the item-test correla- 
tion, (2) the regression of item on test, or (3) the regression of test on 
item. The regression of item on test should be invariant with respect 
to selection on the basis of test score. 

Many of the item reliability indices make use of only a portion of the 
data and estimate a correlation or a slope from widespread classes. As 
far as the author is aware, the efficiency of these methods has not been 
compared with methods using the entire sample, when sample size is 
adjusted so as to secure equal sampling errors. 


ll. Prospective developments in item selection techniques 

In considering the subsequent development of item analysis proce- 
dures, there are a number of problems to which special attention should 
be called. For the special case of tests for which the score is the number 
of correct answers we have several unsolved problems. What are the 
appropriate item selection procedures for controlling the skewness or 
the kurtosis of the distribution of total test scores? The development 
of such procedures will probably present more difficulties than the prob- 
lem of maximizing reliability or validity, since we should usually be 
interested in arriving at some intermediate point, such as zero skew or 
normal kurtosis. This would require much more accurate estimation 
than obtaining the highest reliability or validity possible with a given 
Set of items: : 

A basic assumption in developing the theory of the influence of group 
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heterogeneity (see Chapters 10 and 11) is that the error of measurement 
does not vary systematically with test score. It is likely that this is 
true for some types of item selection and not for others. How can 
items be selected to keep the error of measurement constant at different 
points on the scale? How can items be selected to make the error of 
measurement smallest at a prospective cutting score? Since the varia- 
tion of the error of measurement with test score depends on the third 
and fourth moments, Mollenkopf (1948, 1949), the theoretical analysis 
of the item selection procedures offers some difficulties that have not 
yet been surmounted. However, since the error of measurement is a 
fundamental statistic for a test, it will be a distinct advance when item 
selection techniques can selectively control the error of measurement 
for different test scores. 

In this chapter the theoretical analysis of item analysis procedures 
has been presented only for the special case of the number right score. 
A corresponding analysis of the relationship between the item parameters 
and the test parameters is needed for other types of test scoring pro- 
cedures. For example, a different type of item analysis is appropriate 
if the score is on the basis of level reached as in the absolute scaling 
methods (Thurstone, 1925 or 1927b), or as in the scaling methods 
developed by Guttman or the latent structure methods of Lazarsfeld , 
see Social Science Research Council (1950), Vol. IV. 


is field has been done by Gross- 


The integration of psychophysical 
ould be a major achievement. i 


al or theoretical work has been done on 
the effect of group changes on item parameters. If we assume that a 


given item requires a certain ability (A), the proportion of a group 
answering that item correctly will increase and decrease as the ability 
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level of the group changes. The amount of this change will be greater 
for an item that is highly correlated with ability A than for one that 
correlates only moderately with ability A. If we have some standard 
measure of ability A, it may be that the ability level at which 50 per 
cent pass and 50 per cent fail would not be subject to as much fluctuation 
as the proportion of correct responses. As yet there has been no sys- 
tematic theoretical treatment of measures of item difficulty directed 
particularly toward determining the nature of their variation with 
respect. to changes in group ability. Neither has the experimental work 
on item analysis been directed toward determining the relative invariance 
of item parameters with systematic changes in the ability level of the 
group tested. 

A similar problem of invariance is encountered in considering measures 
of the relationship between an item and the total test score or the 
criterion seore. For example, the reliability index presented in this 
chapter involves the point biserial correlation. This coefficient varies 
systematically with item difficulty, Carroll (1945), Ferguson (1941a), 
and Gulliksen (1945), and consequently will vary with the ability level 
of the group tested. Theoretically there is no such systematic bias in 
biserial correlation. The biserial correlation should not change as the 
item difficult y changes with variations in group ability level. However, 
the data given by Richardson (1936a) showed systematic changes in 
biserial correlation with changes in ability level of the group. It might. 
be found that some statistic related to the error of measurement or the 
slope of the regression line would turn out to be relatively stable despite 
Changes in the mean and the standard deviation of ability in the group 
tested. If such a statistic were developed and used, then in constructing 
any test it would be necessary to have information on the ability range 
to be tested in order to construct a suitable test from the items available. 
As is true for item-difficulty parameters, we do not have the appropriate 
theoretical and experimental investigations showing how different item- 
test correlation measures vary with changes in the average and standard 
deviation of ability of the group tested. 

The discussion in the foregoing paragraph applies both to item-test, 
and item-criterion correlations. There is one additional factor affecting 
item-test correlations that does not influence item-criterion correlations. 
The length of the test of which the item is a part will affect the item-test 
Correlation but cannot influence the item-criterion correlation. For very 
short (two or three items) tests, the item score will form a considerable 
fraction of the test score; hence the item-test correlation will at first 
tend to decrease as items are added to the test. For tests larger than 
fifty or a hundred items, this effect is negligible; and, as the test length 
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increases, a slight increase in item-test correlation could be expected 
because of the decrease in the error component of the total test score as 
test length is increased. Again the appropriate theoretical and experi- 
mental investigations are lacking. It is probable, however, that some 
conditions regarding a minimum number of items for a subset could be 
found so that we might say that neither of these factors is serious as 
long as we consider subsets of no less than, for instance, fifty items. 

In addition to the problems of the relationship between item param- 
eters and test parameters, and the nature of the variation of item 
parameters with changes in other factors such as the length of the test 
and the ability of the group, we have the problem of the most efficient 
statistics to use in estimating these parameters. A complete treatment 
of this problem would include both statistical efficiency in the sense of 
reducing the sampling error of the statistic, and cost efficiency in the 
sense of reducing the labor and machine costs of computation. In com- 
paring different methods for an over-all determination of efficiency, it 
is necessary to adjust the number of cases for each method so as to 
equalize the sampling error, and then compare the costs of dealing with 
these appropriately adjusted numbers of cases. 


Problems 


1. Assume that published data give the biserial correlation between each item and 
the total test or the criterion score. Give the formula for changing biserial correlation 
into the reliability or the validity index discussed here. 


2. Show the relationship between the method of. improving test validity presented 
by Horst (19305), “Item Selection by Means of a Maximizing Function” (Psycho- 
melrika, —) and the method presented here. 


3. Study the material in Guilford, Psychometric Methods, pages 434 and 435, on 
Cook’s index B and Clark's index. Compare these two indices, 


4. The following item analysis information is available on a 35-item test. Which 
items should be eliminated in shortening the test to 30 items? 
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(These data were furnished through the courtesy of Dr. W. G. Mollenkopf 
of the Educational Testing Service.) 


Point Biserial 
Correlion of 
Proportion Item with A 
Item Answering Standard Rella Validity 
Number Item of Tt Si Tides Index 
Correctly Brine Total | Criteri s 
Test n terion 
Score pore 

Pg Sg = V Pe — m Trg Tyg TrgSg TysSg 

1 .800 .400 .280 .203 .112 081 
2 .814 .389 .393 .152 .153 .059 
3 .731 .443 142 .126 .063 .056 
4 .807 .395 256 .327 .101 1129 
5 .241 .428 .409 .168 .175 2072 
6 .379 .485 .266 .188 .129 .091 
7 AT .451 .233 .186 .105 1084 
8 NA .420 .200 .200 .084 .084 
9 .634 .482 .203 .212 .098 .102 
10 .559 .497 .237 .213 .118 .106 
lH .641 .480 .375 .273 .180 .181 
12 .621 .485 .400 -291 .194 | .141 
13 :241 .428 285 .313 .122 | .134 
14 .441 .497 -270 -245 .134 .122 
15 .324 .468 .385 .239 .180 112 
16 .628 .483 .290 .157 140 | .076 
17 «188 * 1345 -287 -165 .099 1057 
18 -483 "500 AM .232 .207 | .116 
19 1097 .296 .253 .267 075 | .079 
20 .455 1498 .301 :309 :150 | .154 
21 .474 .441 .255 .209 .121 
22 “380 1449 1434 -165 “195 | 074 
23 667 ‘471 1193 .208 ‘o91 | 1098 
24 1457 .498 .327 .207 .163 .103 
25 1435 1496 ‘278 .222 38 | 1110 
26 .485 .927 .196 110 | .095 
27 ae "356 213 .247 .076 .088 
28 1305 2460 1143 .191 1066 | 088 
29 "390 “467 .435 .330 .203 .154 
30 1545 “498 143 118 .071 .059 
al .496 .381 | —.008 | .189 |—.004 
32 “356 1479 ‘476 1194 ‘298 | 1093 
33 "713 1452 .257 .150 .116 .068 
34 558 1497 "366 .215 182 | :107 
35 -308 1402 :206 "182 "123 | :084 
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APPENDIX A 


Equations from Algebra, Analytical Geometry, 
and Statistics, Used in Test Theory 


Elementary algebra assumed for test theory 


Expansion of binomials: 
(w+ y)? = 2? + Qry + y? 
(r— y)? = 2? — Qey + y? 


n(n — 1) "t5 n! 
tid S tes pu 
2 E ri(n — r)! 


(e+ y)” = a^ + na?7ly + 
+H nay” + y” 
Factorial notation: 
n! = n(n — 1)(n — 2) --- (8)(2)(1) 
Vxpansion of polynomials: 
atb tyt etea 
+ 2ab +--+ 2ay + 2az+--- 
+ 2by + 2bz +--+ 2yz 
G@+b4---+y+2(A+ BEY) 
salp aB t eretar + 0% +0A + OB bee OY 4 OZ 4... 


+ yA fyBte yY d y d 2A t+ zB 4 d. 2¥ + 2Z 
421 
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Solution of simultaneous linear equations: 


ax+by=c 

dx + ey =f 

c b 

f e ce — bf 
m a b ae — bd 

de 

a c 

d f af — cd 
Ole B] ect 

de 


The solution of a quadratic equation: 


ax? + br +co=0 
_ —b+ Vb? — 4ac 
7 2a 


Analytical geometry assumed for test th 


eory 
Equation of the straight line: . 


y=at+b 
where a is the slope of the line, and 
b is the intercept on the y-axis. 


Equation of a circle with its center at the origin and radius r: 


ata yp=r 
General equation of a circle: 
(z — ay + (y — p)? = 72 
Where r is the radius, 


a is the abscissa value for the center, and 
b is the ordinate for the center. 


Equation of a hyperbola with asymptotes z = 0 and y = 0: 


Ty =c 
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Equation of hyperbola with asymptotes z = a and y = b: 
(r — a)(y — b) =0 
Equation of hyperbola with asymptotes Va yVb = 0: 


Equation of an ellipse: 


Equation of a parabola: 
y=arr+br+e 
The treatment of conics may be generalized as follows: 
A'z? + B'a'y! + O'y’? + D'a! + E'y' + F’ =0 (General equation of 
the second de- 


gree represents 
any conic section.) 


(Transformation used to change the general 


a = =y 
i = EE equation of the second degree to a standard 


V =xsing+ycos? form.) 
L4 
Where tan 26 = yuo 
. 1 — cos 2¢ 
sin ¢ = 2 


1 + cos 26 
cos ġ = = 


By use of the foregoing transformation any equation of the second 
degree can be rotated to the following standard form, such that the 
coefficient of the xy term is zero. 


Aa Cy? + Dx + Ey +F =0 (Standard form for the general 
second-degree equation. Repre- 
sents any conic section [with axes 
parallel to the coordinate axes].) 


If A and C have unlike signs, this equation represents a hyperbola 


with axes parallel to the coordinate axes. . 
If A and C have the same sign, this equation represents an ellipse 


with axes parallel to the coordinate axes. 


* 
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If A equals C, this equation represents a circle. 


If A or C equals zero, this equation represents a parabola with its 
axis parallel to a coordinate axis. 


If A equals C equals zero, this equation represents a straight line. 
Elementary statistics assumed in test theory 
X; = gross score of individual ¢ 


N = number of persons in sample 


(mean) 
ti = X; — Mx (deviation score) 

N 
2; = (mean deviation score) 

i=1 

N N 

Xe Px 

e = E == Mx (variance of a specified 
N sample) 
N 


i (estimate of variance in 
= universe from which 


sample is drawn) 
See Yule and Kend 


all (1940), pages 434-436, 
use of N and N 


for a discussion of the 
— lin the denominator of the f. 


ormula for variance, 
N N 
2j: = Y X? -NM 


» (gross score formula for sum of 
= i=] * . 
z * Squares of deviations from the 
mean) 
Lay Iry 
tey = —M 


RE. (deviation Score formula for 
Nes, Vx zy correlation) 
ZXY —~ NM xMy 
Tay = EE NM yr Vu’ (gross score formula, for 
i r correlation) 
Zzy DKXY — NMxM. 
TrySzSy = — = EA (covariance) 


| 
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Sy 


Ú = Tzy— x (deviation score formula for re- 
Sz gression of y on x used to es- 
timate y from x) 

x Sy 
y X + My — r,—Mx (gross score formula for 
Sz regression of Y on X) 

Sz 
$-—ry—y (deviation score formula for 
Sy regression of x on y used 
to estimate x from y) 
x Sz Sz 

X = ty—Y¥ + Mx — rzy — My gross score formula for 
Sy Sy regression of X on Y) 


5 ; nem 
Soc SM 1 — ae? (error of estimate, error made 


in estimating y from x) 


V 3 "fases : "x 
Sey — EV 1 — tay (error of estimate, error made 


in estimating x from y) 


-— : j 
fo = fa aans = (partial correlation, the correla- 

z 3 i R 
V1 = rz: Vi - Ty: tion between x and y for a 


constant value of z) 


, 2 : ; i 
Spy = S + Sy — WarySzrSy (standard deviation of 
a difference) 


Spy = s? + sy? + reySz8y (standard deviation of 
a sum) 


(standard deviation of a sum or 
difference for the special case 
of zero correlation) 


2 5 c 
Sz—y = SS + Sy = Sz+y 


Aj — 1 or 0 (score of individual 7 on item g) 


N 
bD A ig 


(proportion of individuals answer- 


"s ing item g correctly) 


(proportion of individuals answer- 


=1—D, 
4 ing item g incorrectly) 


Sg? = Dele = Pe — Pe? (variance of item g) 
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N 
È Xa 
Xp = (average test score for persons 
Np, answering item g correctly) 
N 
25 Xiü em A ig) 


XQ = 


(average test score for persons 
Na; answering item g incorrectly) 
2, — normal curve ordinate corresponding to the area indieated by 
Dg OY Ug 


N 
OV 2r 


where N = number of cases, 


oz = standard deviation of distribution, 


Nz, = 


eC 91202) 


(ordinate of normal curve) 


T = 3.1416, and 
1 n 
€ — 2.7183 — lim ( + -) 
n— n 


P ES, NR q 
— (=z) (=) 
Sz Zp (equivalent formulas for 


- ae biserial correlation of 
a- (PHS) (0A 
bis; = | —— —— — 


item g with score x) 


Sz Zg 


Ze =EN 
pl-bislzg = (=) Pellg r 
Sz (equivalent f. ormulas for point- 
Z _ biserial Correlation of item g 
e — 3 Vp, with score x) 
pt-bis'zg = | ————— 
Sz dg 


Use of the summation sign 
If k represents a constant, and z, y, z, and w represent variables, the 

major principles in the use of the summation s 

the following equations. Since all su 


persons, the subseripts and limits are not, given. 


Ele +y) = Iz + Ly (The sum of T+ y is equal to summation x 
plus summation y.) 
2@— y) = Ze — Dy (The sum of a ge 


t of differences is equal to the 
difference of t * 


he two sums.) 
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The = kEx (The sum of a constant I times a variable x is equal to 
the constant multiplied by summation 2.) 


Ek = Nk (The sum of a constant term is equal to N times the 
constant term.) 


Combinations of these principles with elementary algebra is illustrated 
in the following equations. 


Sk(x + y) = Wea + KXy 
E(k + y)(z + w) = haz + Xyz + Eyw + kEXxw 


The score matrix 


Xua Xa Xu co Mak 
Xo Xa» Xos RUE Xok 
Xa Xa Xs cc Xsk 
Xyi Xwe Xma cc ZNE 


The foregoing matrix represents the scores of N persons on each of 
K tests. The first subscript designates the persons (from 1 to N); and 
the second subscript designates the tests (from 1 to K). The scores 
In any given column are the scores of all the persons on one test, and the 
Scores in any row are the scores of one person on all the tests. 

The general term in this matrix may be written 


Xa ü=1--N;g=1--- K) 


Xig indicates the score of the ith person on the gth test. The notation 
in parentheses shows that 4 varies from 1 to N and g from 1 to K. 
The mean of any particular test (g) is written as follows in the double 


Subscript notation: " 


22 Xa 


i=1 


N 


Ma = 


The period is used to indicate the position of the subscript over which 
We have summed. This is read: The mean of the gth test is equal to 
the summation of X sub 4 g for test g from i equals 1 to 2 equals N, di- 


Vided by the number of persons. 
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Using this same double subscript notation, it is possible to express 


the average score on all the tests for any one person. ‘This is written 


K 


"which is read: The average score of the ith 
. the scores (X;;) for that person, from g equal 
by K (the number of tests). The period is u: 
since summation was over this subscript. 
The average score of all persons on all 
with a double summation notation, 


person equals the sum of 
s 1 to g equals K, divided 
sed in place of subscript g 


the tests would be expressed 


N K 
p» » Xie 
r i=l g=1 
NK 
This is read: M is defined as the summ 
from g equals 1 to g equals K, summed 
1 to z equals N, divided by N times K. 
Wherever we are dealing with several persons and tests, the double 
subscript notation is desirable to avoid ambiguity. If no ambiguity 
arises, it is permissible to omit the subscripts after X , and also to omit 
the limits above and below the summation sign. 
Tt should be noted that the matrix of scores ig not symmetric. The 


score of the second person on the third test is different from the score 
of the third person on the second test. 


(X23 # X33) 
On the other hand, the variance- 
matrix is symmetric. The cor 
test 3 is identical with that of t 


ation of Xi, with respect to g 
with respect to 7 from 4 equals 


covariance matrix or the intercorrelation 


relation (or covariance) of test 2 with 
est 3 with test 2, 


The correlation matrix 


2 
$i 1128182 7138183 TLKS|SK 
9,2 3 
Ti28189 — $9 1238283 ToKSoSk 
9 
138183 T238283 — ss? 


T3K 838K 


TIKSISK T2KS28K T3KS838K 


P omo AA 
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The foregoing matrix shows the variances and covariances for a set of 
K tests or items. The variance of the sum of the tests (or items) is the 
sum of the terms in this varianee-covarianee matrix. This sum may be 
written in several different ways, 


K K 
p» 25 TehSeSh (ree = Tan = 1) 
g=1 h=1 


In order to show explicitly the difference between the variances and . 
covariances, we may write 


K K K 
RE F D 2l rase 
d ` g=l h=1 

(eh) 


Sometimes the second term is written without the two summation 
signs and the upper limit used to designate the number of terms as 
follows: 


K K?-K 
E set XL Tensen 
g=1 gæh=1 


With this notation it is understood that, since the terms where g = h 
are omitted, there are K? — K terms in the second summation. Since 
the terms above the principal diagonal are identical with those below 
it, this sum for a symmetric matrix is sometimes written 


K-1 


K 
» Sg? T PNE T'ghSgSh 


gl g»h-1 
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Table of Ordinates and Areas 
of the Normal Curve 
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TABLE OF THE NORMAL CURVE 


RDINATES (z) AND CUMULATIVE ÁREA (A) or THE Ricut HALF or THE NORMAL 
CURVE or DISTRIBUTION OF UNIT AREA 


For cumulative of whole curve, read .5 + A for + z/c. Ordinates are represented 
terms of the total area as unity. 


z/o z A z/o z A 

0.00 | 0.39804 0.00000 0.50 | 0.35207 0.19146 

0.01 | 0.39892 0.00399 0.51 | 0.35029 0.19497 
0.02 | 0.39886 0.00798 0.52 | 0.34849 0.19847 

0.03 | 0.39876 0.01197 0.53 | 0.34007 0.20194 | 
0.04 | 0.39862 0.01595 0.54 | 0.34482 0.20540 i. 
0.05 | 0.39844 0.01994 0.55 | 0.34204 0.20884 } 
0.06 | 0.39822 0.02392 0.56 | 0.34105 0.21226 | 
0.07 | 0.39797 0.02790 0.57 | 0.33912 0.21566 | 
0.08 | 0.39707 0.03188 0.58 | 0.33718 0.21904 d 
0.09 | 0.39733 0.03586 0.59 | 0.33521 0.22240 

0.10 | 0.39695 0.03983 0.60 | 0.33322 0.22575 | 
0.11 | 0.39654 0.04380 0.61 | 0.33121 0.22907 | 
0.12 | 0.39608 0.04776 0.62 | 0.32018 0.23237 

0.13 | 0.39559 0.05172 0.68 | 0.32713 0.23565 

0.14 | 0.39505 0.05567 0.64 | 0.32508 0.23801 

0.15 | 0.39448 0.05962 0.65 | 0.32297 0.24215 

0.16 | 0.39387 0.06356 0.00 | 0.32086 0.24537 

0.17 | 0.39322 0.06749 0.07 | 0.31874 0.24857 

0.18 | 0.39253 0.07142 0.68 | 0.31659 0.25175 

0.19 | 0.39181 0.07535 0.69 | 0.31443 0.25490 

0.20 | 0.39104 0.07926 0.70 | 0.31295 0.25804 

0.21 | 0.39024 0.08317 0.71 | 0.31006 0.26115 

0.22 | 0.38940 0.08706 0.72 | 0.30785 0.26424 

0.23 | 0.38853 0.09095 0.73 | 0.30563 0.26730 

0.24 | 0.38762 0.09483 0.74 | 0.30339 0.27035 

0.25 | 0.38067 0.09871 0.75 | 0.30114 0.27337 

0.20 | 0.38508 0.10257 0.76 | 0.29887 0.27637 

0.27 0.38466 0.10642 0.77 0.29659 0.27935 

0.28 | 0.38361 0.11026 0.78 | 0.29431 0.2 

0.29 0.38251 0.11409 0.79 0.29200 eem 

0.30 0.38139 0.11791 0.80 0.28969 RH 

0.31 0.38023 0.12172 0.81 0.28737 CS 

0.32 0.37903 0.12552 0.82 0.28504 ved 

0.33 0.37780 0.12930 0.83 0.28269 poc 

0.34 | 0.37654 0.13307 || 0.84 | 0.28034 d 

0.35 0.37524 0.13683 0.85 0.27798 "RUM 

0.36 0.37391 0.14058 0.86 0.27562 ET 

0.37 0.37255 0.14431 0.87 0.27324 (9088 

0.38 0.37115 0.14803 0.88 0.27086 rete 

0.39 | 0.36973 0.15173 | 0.89 | 026848 dd 

0.40 0.36827 0.15542 0.90 0.26609 ud 

0.41 0.36678 0.15910 0.91 0.26369 AEA 

0.42 | 0.36526 0.16276 | 0.92 | 0.96199 Tex 

0.43 | 0.30371 0.16640 | 0:93 | 0.25888 rote 

0.44 | 0.36213 0.17008 | 0.94 | 0.25647 poem 

0.45 | 0.36053 0.17364 | 0.95 | 0.2540 0 52689 

0.46 | 0.35889 | 0.1724 | oss | oam | 0.32804 

0.47 | 0.35723 | 0.18082 | 0.97 | ozmos | 033147 

0.48 | 0.35553 0.18439 0.08 | 0.24681 Wc 

0.35381 0.18793 | 0:99 | 0.24439 id 


Reprinted by permission from Business Statistic 


en 5, by George R. Davies 
‘oder, Second Edition, pages 582-585. N r . Davies and Dale 
; a York: John Wiley and Sons, Inc. 
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TABLE OF THE NORMAL CURVE—Continued 


z 


z A 


R 
S 
q 


.12952 0.43319 
.12758 0.43448 
.12566 0.43574 
.12376 0.43699 
.12188 0.43822 
.12001 0.43943 
11816 0.44062 
11632 0.44179 
11450 0.44295 
.11270 0.44408 
11092 0.44520 
10915 . 0.44630 
10741 0.447398 — 
.10567 0.44845 
.10396 0.44950 
.10226 0.45053 
.10059 0.45154 
.09893 0.45254 
.09728 0.45352 
.09566 0.45449 
0.09405 0.45543 
0.09246 0.45637 
0.09089 0.45728 
0.08933 0.45818 
0.08780 0.45907 
0.08628 0.45994 
0.08478 0.46080 
0.08329 0.46164 
0.08183 0.46246 
0.08038 0.46327 
0.07895 0.46407 
0.07754 0.46485 
0.07614 0.46562 
0.07477 0.46638 
0.07341 0.46712 
0.07206 0.46784 
0.07074 0.46856 
0.06943 0.46926 
0.06814 0.46995 
0.06687 0.47062 
0.06562 0.47128 
0.06438 0.47193 
0.06316 0.47257 
0.06195 0.47320 
0.06077 0.47381 
0.05959 0.47441 
0.05844 0.47500 
0.05730 0.47558 
0.05618 0.47615 
0.05508 0.47670 


.24197 0.34134 
.23955 0.34375 
.23713 0.34614 
.29471 0.34850 
: 23230 0.35083 
.22988 0.35314 
.22747 0.35543 
.22506 0.35769 
.22265 0.35993 
.22025 0.36214 
.21785 0.36433 
.21546 0.36650 
.21307 0.36864 
.21069 0.37076 
.20831 0.37286 
.20594 0.37493 
. 20357 0.37698 
.20121 0.37900 
198: 0.38100 
T 0.38298 
19419 0.38493 
19186 0.38686 
18954 0. NT 
18724 0.3 
18494 0.39251 
18265 0.39435 
18037 0.39617 
17810 0.39796 
17585 0.39973 
17360 0.40147 
17137 0. FS 
16915 0.4 
16694 0.40658 
16474 0. PEE 
16256 0. 
16038 0.41149 
15822 0.41309 
15608 0.41466 
15395 0.41621 
15183 0.41774 
14973 0.41924 
14764 0.42073 
14556 0.42220 
14350 0.42364 
14146 0.42507 
13943 0.42647 
13742 0.42786 
13542 0.42922 
13344 0.43056 
13147 0.43189 
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TABLE OF THE NORMAL CURVE—Continued 
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A 


zZ 


A 


BO BO b) BO BO BO BO BO BO DO BO BO BO PO PO PO FO NO E 


2. 
2. 
2. 
2. 
2. 
2. 
2. 
2. 
2. 
2. 
2; 
2. 
2. 
2. 
2. 
2. 
2. 
2. 
2. 
2. 
2. 
2. 
2. 

2. 

2. 

2. 

2. 

2. 
2. 

2. 

2. 


0.05399 
0.05292 
0.05186 
0.05082 
0.04980 
0.04879 
0.04780 
0.04682 
0.04586 
0.04491 
0.04398 
0.04307 
0.04217 
0.04128 
0.04041 
0.03955 
0.03871 
0.03788 
0.03706 
9.03626 
0.03547 
0.03470 
0.03394 
0.03319 
0.03246 
0.03174 
0.03103 
0.03034 
0.02965 
0.02898 
0.02833 
0.02768 
0.02705 
0.02643 
0.02582 
0.02522 
0.02463 
0.02406 
0.02349 
0.02294 
0.02239 
0.02186 
0.02134 
0.02083 
0.02033 
0.01984 
0.01936 
0.01889 
0.01842 
0.01797 


0.47725 
0.47778 
0.47831 
0.47882 
0.47932 
0.47982 
0.48030 
0.48077 
0.48124 
0.48169 
0.48214 
0.48257 
0.48300 
0.48341 
0.48382 
0.48422 
0.48461 
0.48500 
0.48537 
0.48574 
0.48610 
0.48645 
0.48679 
0.48713 
0.48745 
0.48778 
0.48809 
0.48840 
0.48870 
0.48899 
0.48928 
0.48956 
0.48983 
0.49010 
0.49036 
0.49061 
0.49086 
0.49111 
0.49134 
0.49158 
0.49180 
0.49202 
0.49224 
0.49245 
0.49266 
0.49286 
0.49305 
0.49324 
0.49343 
0.49361 
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0.01753 
0.01709 
0.01667 
0.01625 
0.01585 
0.01545 
0.01506 
0.01468 
0.01431 
0.01394 
0.01358 
0.01323 
0.01289 
0.01256 
0.01223 
0.01191 
0.01160 
0.01130 
0.01100 
0.01071 
0.01042 
0.01014 
0.00987 
0.00961 
0.00935 
0.00909 
0.00885 
0.00861 
0.00837 
0.00814 
0.00792 
0.00770 
0.00748 
0.00727 
0.00707 
0.00687 
0.00668 
0.00649 
0.00631 
0.00613 
0.00595 
0.00578 
0.00562 
0.00545 
0.00530 
0.00514 
0.00499 
0.00485 
0.00471 
0.00457 


0.49379 
0.49396 
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TABLE OF THE NORMAL CURVE—Continued 


A z 


0.49865 s 0.00087 0.49977 

0.49869 2 0.00084 0.49978 
0.40874 ; 0.00081 0.49978 
0.49878 s 0.00079 0.49979 

0.49882 : 0.00076 0.49980 

0.49886 ` 0.00073 0.49981 

0.49889 5 0.00071 0.49981 

0.49893 K 0.00068 0.49982 
0.49897 i 0.00066 0.49983 

0.49900 š 0.00063 0.49983 

0.49903 0.00061 0.49984 
0.49906 0.00059 0.49985 

0.49910 0.00057 0.49985 

0.49913 0.00055 0.49986 
0.49916 0.00053 0.49986 
0.49918 0.00051 0.49987 
0.49921 0.00049 0.49987 
0.49924 0.00047 0.49988 
0.49926 0.00046 0.49988 
0.49929 0.00044 0.49989 
0.49931 0.00042 0.49989 
0.49934 0.00041 0.49990 
0.49936 0.00039 0.49990 
0.49938 0.00038 0.49990 
0.49940 0.00037 0.49991 

0.49942 0.00035 0.49991 

0.49944 0.00034 0.49992 
0.49946 0.00033 0.49992 
0.49948 0.00031 0.49992 
0.49950 0.00030 0.49992 
0.49952 0.00029 0.49993 

0.49953 0.00028 0.49993 
0.49955 0.00027 0.49993 

0.49957 0.00026 0.49994 
0.49958 0.00025 0.49994 
0.49960 0.00024 0.49994 
0.49961 0.00023 0.49994 
0.49962 0.00022 0.49995 
0.49964 0.00021 0.49995 
0.49965 0.00020 0.49995 
0.49966 0.00019 0.49995 
0.49968 0.00018 0.49996 
0.49969 0.00017 0.49996 
0.49970 0.00016 0.49996 
0.49971 0.00015 0.49996 
0.49972 0.00014 0.49997 
0.49973 0.00013 0.49997 
0.49974 0.00012 0.49997 
0.49975 0.00011 0.49997 
0.49976 0.00011 0.49998 
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3. 
3. 
3. 
3. 
3. 
3. 
3. 
3. 
3. 
3. 
3. 
3. 
3. 
3.26 
3. 
3. 
3. 
3.3: 
3. 
3. 
3. 
3. 
3. 
3. 
3. 
3. 
3. 
3. 
3. 
3. 
3. 
3. 
3. 
3. 


* For skipped z/s items below, read values next preceding. 


APPENDIX C 


Sample Examination Questions in Statistics 
for Use as a Review Examination at the Beginning 
of the Course in Test Theory? 


The following two experiments were performed: 
Experiment 1, The average of the men on a physical sciences test is 243.0. The 
average of the women is 226.5. The standard error of the difference 
is 5.0. 
Experiment 2, The average of the men on an English test is 158.4. The average of 
the women is 182.4. The standard error of this difference is 16.0. 


Mark the following statements according to this code: 


1. Applicable to experiment 1 
2. Applicable to experiment 2 
0. Applicable to neither experiment 1 nor to experiment 2 
— It would be worth while repeating this experiment with twice as many cases in 
each group. 


— It would be worth while repeating this experiment with four times as many 


cases in each group. 


— Since chance variation will not explain the results of this experiment, it is 
plausible to assume that there is a sex difference in the ability involved in this 


test, 
— Since chance variation will explain the results of this experiment, I do not feel 
that it is worth while to investigate this problem any further, 


—— Differences larger than those obtained in this experiment would occur only one 
liffered only by chance. 


time out of a thousand if the two groups ¢ 


—— There is only one chance out of a thousand that the difference between the two 


groups was due to the influence of chance. 
—— The difference of means is significant. 
—— The difference of means is not significant. 
1 If students are not required to memorize formulas, items such as these are suit- 


able for “open-book”’ examinations. 
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An intelligence test and an arithmetic test are given to a group of 1000 students. 
The correlation coefficient, means, standard deviations, and parameters of both re- 
gression lines are computed. In rechecking the results it is found that the intelligence 
test has been scored accurately, but there were a number of errors in the scoring of 
the arithmetic test. Assume that these errors are completely random errors. 


For each of the measures listed below write: 


1. If the correct value is larger than the value already computed. 
2. If the correct value is smaller than the value already computed. 
3. If the two values are the same. 


The standard error of the mean of the arithmetic scores. 
— — The standard deviation of the distribution of intelligence scores. 
——— The mean arithmetic test score. 


—— The standard deviation of the errors made in predicting 


arithmetic test score 
from intelligence test score. 


—— The coefficient of alienation. 
—— The coefficient of correlation between intelligence and arithmetic, 


—— The variance of the predicted arithmetic test scores (that is, predicted from the 
regression of arithmetic test score on intelligence test score), 


—— The variance of the observed arithmetic test scores minus the variance of the 
predicted arithmetic test scores. 


—— The ratio of the standard deviation of the predicted intelligence test scores 
(that is, predicted from the arithmetic score) to the standard deviation of the 
observed intelligence test scores. 


—— The square of the 


alienation coefficient plus the Square of the correlation 
coefficient. 


—— The product of the alienation coefficient and the standard deviation of the 
distribution of observed scores. 


—— The slope of the regression of arithmetic on intelligence, 


The standard deviation of the arithmetic test, 


The slope of the regression of intelligence on arithmetic. 


Appendix C 439 


Before each of the following items write the number of the one formula from the 
following list of six that is most directly connected with the problem to be solved. 
Be sure not to make any calculations, just indicate the one best formula in each Bu 


1. M 4-355 (s. 2 =) 


VN 
Mi — Mz 
2, ———— (sg = Vs2/M1 + s/N) 
Sd 
M; =M: pes 
3, —— — (sa = V(1/N)(s1? + s — 2rjsiss) ) 
d . 
4. ksy 
ss 
Nszsy 
6. r (2) X4 M,—r (2) M. 
Sz Sz 


— How can I estimate the geometry score of a student from his performance in 
algebra? 


How far wrong is one likely to be when using arm-length to estimate height. 


A sample of 100 Wistar adult white rats has an average weight of 342.5 grams; 
the standard deviation of the distribution of weight is 9.3 grams. What are 
reasonable upper and lower limits for the average weight of all Wistar adult 


white rats? 


Which of two aptitude tests would it be better to use for estimating grades in 

this college? 

— An experiment is performed using two persons (one brother and one sister) from 
each of a hundred families. An intelligence test is given to these two hundred 
persons. Do brothers score higher than their sisters? 

—— An instructor has two classes. In one there are 150 students, and in the other 

there are 136 students. The same intelligence test is given to the entire group 

of 286 students. Is the average intelligence of one class clearly higher than 
that of the other? 


— I want to predict the speed with which a rat will learn maze B from its per- 


formance in maze A. 


— .. Are rats more active on days when they have thyroid extract than on control 


days when they do not get the extract? 
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The Theory of Mental Tests 


Before each statement given below, put a circle around the number(s) indicating 
the assumption(s) which must be made if the statement is to be regarded as correct. 
Use the following code: 


o wH 


the mean. 


olo 


12345678 
12345678 
12345678 


12845678 
12345678 


12345678 


12345678 


12345678 


12345678 


12345678 


12345678 


- A zero point which is not arbitrary. 

. A constant unit of measurement. 

- The assumed mean is approximately equal to the true mean. 

- The cases are evenly distributed within the class interval. 

- The number of cases in a class interval varies inversely with the distance from 


. The two distributions have the same number of cases. 
. The two distributions have the same mean. 
- The statement is correct as it stands, no assumption being involved. 


The differences between brothers can be measured by taking the 
differences of their test scores, 


The differences between brothers can be measured by taking the 
ratios of their test scores. 

The mean may be computed by grouping the data in class 
intervals, 

The simplest method of 
data and an equivalent 
trary origin. 


calculating the mean is by using grouped 
scale with an assumed mean and an arbi- 


The standard deviation of a di 
using grouped data and an e 
mean and an arbitrary origin, 


stribution may be caleulated by 
quivalent seale with àn assumed 


The mean may be calculated from the formula M = ZXJ/N. 


The median may be calculated from a frequency distribution 
plotted in class intervals of ten. 


A class of students is divide 
administering a given test, 
class can be calculated from 
number of cases for each sec 


d into two sectio, 
The stand 
the means, 
tion. 


ns for the, purpose of 
ard deviation of the total 
, Standard deviations, and 


section, 


John is twice as intelligent as James. 


If a class in geometry h 
semester, the final ranki 


as been given three tests during the 
by summing these three 


ng of the students can be determined 
Scores for each student. 
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All applicants for admission to the university are given an English examination 
and a scholastic aptitude test. In 1933 the standards for admission based entirely 
on scholastic aptitude scores were raised. As a result the size of the entering class 
decreased markedly. 


Mark the following items: 


1. If it will probably be larger in the freshman class of 1932. 
2. If it will probably be larger in the freshman class of 1933. 


3. If it will probably be about the same in both classes. 
4. If not enough data are given, or if code numbers given above do not apply. 


— — Mean score on the scholastic aptitude test. 

— — Standard deviation of the English examination. 

—— Pearson correlation coefficient between English and aptitude scores. 
——— Variance of the English scores as estimated from the aptitude test scores. 


—— The standard deviation of the errors of prediction of English from aptitude test 


scores. 
— Coefficient of alienation. 


— Slope of the line of regression of English on aptitude scores. 


—— Slope of the line of regression of aptitude on English scores. 


andard error of estimate (that is, error made in estimating 


——— The ratio of the st: fees E 
English scores) to the standard deviation of the aptitude 


aptitude scores from 
scores. 


— — Standard deviation of aptitude scores. 


For each of the statements below write: 


1. If it applies to the mode. 

2. If it applies to the median. 

3. If it applies to the mean. 

4. If it applies to none of these terms. 


— The abscissa of the highest point on the frequency distribution. 


— The ordinate of the highest point on the cumulative frequency curve. 


— — The z-value of the steepest part of the cumulative frequency curve. 


— The point halfway between the two extreme values of the distribution. 


— The score value so chosen that exactly 50 per cent of the scores are higher 


than it. 


— The measure which lends itself most readily to algebraic treatment. 
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Calculate the mean and standard deviation of the following distribution: 


Score Frequency 
160-174 10 
145-159 22 
130-144 45 
115-129 18 
100-114 5 


Present all your calculations in an orderly data sheet. 

Suppose that after you have computed a mean, mode, median, range, and standard 
deviation on the data shown in the above tabulation, you find that two scores that 
are tabulated as 146 in the distribution are erroneous and should be tabulated as 129. 
Do not calculate means or standard deviations; answer simply from the general trend of 
the data. 


The value of the mean already computed is (the same as, larger than, smaller than) 
the correct value. 


The value of the mode already computed is (the same as, larger than, smaller than) 
the correct value. 


The value of the median already computed is (the same as, larger than, smaller 
than) the correct value. 


The value of the standard deviation already computed is (the same as, | 
smaller than) the correct value. 


The value of the range already computed is (the same 
the correct value. 


arger than, 


as, larger than, smaller than) 


Given originally a symmetrical distribution of 500 cases with a mean of 100 and 
ao of 25. 


Add to this a second symmetrical distribution of 100 cases with individual scores 
ranging from 110 to 126. 


Before each of the measures listed below write: 


1. If the measure is increased by adding the second distribution. 

2. If the measure is decreased by adding the second distribution, 

3. If the measure is not affected by adding the second distribution. 

4. If it is impossible to tell what will happen from the information given. 


^ 


— — Mean 

— — Median 

—— Range 

— — Mode 

— — Standard deviation 


— Average deviation 
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Below are shown curves for six distributions, some of which are cumulative fre- 
quency curves and others are column diagrams. 


100 


60 70 


or alternatives, in each of the parentheses. Be 
een distribution, which is the general term, and 
which refer to specific types of 


Encirele the correct alternative, 
sure to distinguish carefully betw 
column diagram and cumulative frequency curve, 
plots, 


The column diagram with the smallest standard deviation (A, B, 


The column diagram with the smallest mode......-+++++ (4 


The column diagram with the largest mean... - (A, B, C 
The column diagram with the smallest range... T" pr | d - s ee 


The column diagram with the smallest N...... sws 
The cumulative frequency curve with the smallest standard 

deviation. wc oee err EE nt 
The cumulative frequency curve with the 
The cumulative frequency curve with the larg 


one) 
one) 


B 

largest mean. ... (A B 
est mode.... (A, B, C, D, E, F, None) 

B 

I 


The distribution with the largest range... «55550000007 (A, Sone) 
The distribution with the smallest range.. -+ +--+ +5050" * ^ "A "3 T 5 T Sone) 
The distribution with the largest median. . : e s ee : ^m (A, B, C, D, E, P, None) 
The distributi thich is (are) negatively skewed a 

e distribulion(s) which is (are) neg B, C, D, E, F, None) 


unimodal...... eee nee 
The distribution(s) which is (are) positive 
VHS, aep at temone begs, £8 Me ALE” 
The distribution(s) which is (are) bimodal....--.-+++++++ 


skewed and 
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Fill in the spaces in the following table: 

The relationships between the three terms M. , N, and X. 
to find the value of one of them when the other two are 
such problems. : 

Fill in the blank spaces. 


X are such that it is possible 
Eiven. Below are a series of 


4 6 
35 140 
h 84 
8 104 
4 100 
7 9 
eS, 


If the mean of a given set of scores (X) is 12 
15, find 2(X — k), where k is 5. and the number of cases is 


m 3X pa __ 


If 2X is 133 and DY is 95, the mean of the X- i 
scores is 5. What is the value of E(X — Y? Scores is 7 and the mean of the Y- 


Ans. XX. y) = 
If 25 students took both a vocabulary test and an intelli 
Ram e E MOM ee intelligence test, and the follow- 


Vocabulary test average = 56; intelligence test average — 


51; then 
The sum of all the scores in the vocabulary test is 


The sum of all the scores in the intelligence test is 


The sum of all the scores in both tests is — — 


Tf each student is given a composite score, which is found by taki A 
test score and adding to it the intelligence test score multiplied 1 ing his vocabulary 
this set of 25 composite scores will be Y 2, the average of 


Tf each student is given a composite score, which is found by sub 
ligence test score from his vocabulary test score, the average or acting his intel- 
Scores will be these composite 


If a new vocabulary score is found by deducting 10 fro 


m each 
vocabulary test score, the sum of these new scores will be Student; 


8 original 


And the average of these new scores will be p 
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In the following problems c and k represent constants; also M (the mean) and N 
(the number of cases) are constants. X and Y are variables, S indicates “the sum of." 
Simplify each of the following expressions. 


zi N 
S MX = S (NY +c — XM +X’) = 
4 1 
N N 
S (kX + cY) = S(Y-o- 
1 1 
N N 
S (YN) = S(k+Y)= 
1 1 
N N 
S (2XM) = 7 (X + Y(X — Y) = 
1 
N 
N n E 
S (24?) = 5 (X + Y)¥] = 
1 
N 
N z bl 
X+Y)}= 
S (NM?) = A zl 
1 
N N 
S (X — YY + kX) = 
S (M? — kX) = ; ii XY + kX)] 
1 
N 


S (MX + Y! 4 cY) = 
1 


II wane variables and a, b, c, and d are constants, simplify the follow- 
, , », a 


Ing expressions: Zabd = 


ZVaWc = 


SU + 0)(Z - d) = 


In a positively skewed distribution: 

(larger than, smaller than, the same as) the median. 
han, smaller than, the same as) the mean. 
smaller than, the same as) the median. 


The mode is generally 
The mode is generally (larger t 
The mean is generally (larger than, 
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Each of the following cases shows a discrepancy between the mean as calculated 
from the formula M = ZX/N and the mean calculated by grouped data and an ar- 
bitrary origin. 

Before each of the following cases place the number of the one comment that best 


applies. 
Comments: 
1. There must be an error in computation or tabulation. 
2. If the arbitrary origin were placed nearer to the mean of the distribution, the 
discrepancy between the two means would be partially corrected. 
3. Such a discrepancy between the two means is reasonable, 
4. There is either some error in computation or tabulation, or else poor judgment 
has been shown in choosing the limits for the class intervals. 
5. Such a discrepancy between the two means is reasonable when one has such a 
large standard deviation. 
Mean Calculated Standard 
ZX/N with Class Class Arbitrary | Deviation 
"Interval and Interval Origin of Distri- 
Arbitrary Origin bution 
— Case A 38.38 43.28 10 75.0 15 
Case B 38.38 37.90 5 37.5 10 
Case C 38.38 36.71 3 38.0 12 
ae | ee a] — 
Case D | 156.93 151.72 10 150.0 50 
SS] 'M à 
Case E 592.41 598.38 50 525.0 175 
Case F 417.36 405.92 25 300.0 | Me as 


APPENDIX D 


Sample Examination Items in Test Theory! 


After each of the following statements, encircle the letter or letters of all the state- 

ments which apply. Use the following code. 

O = a statement that could not reasonably be true. 

T = a statement that is unconditionally true. 

A = true if the mean error is assumed to be zero. 

B = true if the correlation between errors and true score is zero. 

C = true if the correlation between two sets of errors is zero. 

D = true if the standard deviation of two sets of errors are the same. 


The observed score is equal to the true score plus the error... O T A B 


Equivalent forms of a test will have the same standard devia- 


* 
The average true score is equal to the average observed score. O 


The true variance is equal to the error variance plus the ob- 
Served Variances > 22 sca ere nmm nnn nim aieo WANE maS sare: Ae 


OT AB CD 


The error variance is equal to the observed variance multi- 
plied by the difference between unity and the reliability 


eoefficlent. a a cece cross eer emt dip seis enm nem ent hd OTA B GO D 


The average error is equal to the sum of the errors divided by 
the number of errors........ se nek os TIG Kt Oo T. A E C D 
The true variance is equal to the reliability coefficient, multi- 


plied by the observed varianee..... s 0 ntn Oo T A BS D 


The correlation between true scores and observed scores is 


equal to the square of the reliability coefficient..........- OT A BCD 


The square root of the difference between unity and the reli- 
Jation between ob- 


ability coefficient is equal to the corre 
Served scores and error ScoreS..------+sss0crcttcrttttt O T A BC D 


The observed variance less the true variance is equal to the 
ÉITOR wafBnGO.. «essa puse sare un eo EIE SHE RHE ones nun e 
items such as these are suit- 


OT AB CD 


! If students are not required to memorize formulas, 


able for “open-book” examinations. 
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Miscellaneous formulas applicable to a single test: 


r — reliability of the test. 

7, = reader reliability of the test. 

e = the standard error of measurement. 

o = the true standard deviation. 

s = the standard deviation of the test scores, 


d = the standard deviation of the difference between scores on comparable halves 


of the test. 
1. o/s f 5. VI -r 
2. e/s 6. o e 
3. o/s 7. r/r; 
4. Vr -e 


EET] 
For each of the following items write 


the number of the one or more formulas that 
are clearly indicated. Be sure to give a 


ll the answers that are correct. 
—— The reliability coefficient. 


—— The correlation between true Scores and observed scores, 
—— The correlation between errors and observed scores, 
—— The correlation between comparable halves of a test, 
—— The total test variance. 

—— The content reliability of the test. 

—— The true variance divided by the reliability coefficient, 


—— The Spearman-Brown formula wi 


ould need to be used on this uantity in order 
to get the reliability of the test. i s: 


— This will decrease with’an increase in the length of the test, 
—— The index of reliability. 


Given the following information from the manual on each of two 


spelling tests: Standardized 
a a mU Ld 
Standard " 
Mean | Devia- Reli- 
ton ability 
Test A 100 20 81 
Test B 200 40 95 


Estimate the standard error of measurement of test A. 


Estimate the standard error of measurement of test B, 
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Estimate the correlation between true scores and observed scores for 
test A. 


Student X scores 100 in test A. What is the standard error of this 


score? -—  — 
What is a reasonable upper limit for the true score of student X? 
What is a reasonable lower limit for the true score of student X? — 3 


Student Y scores 300 in test B. What is the standard error of this 
score? 


What is a reasonable upper limit for the true score of student Y? 
What is a reasonable lower limit for the true score of student Y? 
What is the standard deviation of the true scores in test A? 

What is the standard deviation of the true scores in test B? 

What is the mean of the true scores in test A? 

What is the mean of the error scores in test B? 

Give the index of reliability for test B. 

What is the correlation between true scores and error scores for test A? 
What is the correlation between observed scores and errors for test B? 


Student Z receives a standard score of 1.5 in test A. What is his gross 
Score? 


What is the standard error of the standard score of 1.5? 


What is the probable upper limit for the true standard score of stu- 
dent Z? 


What is the probable lower limit for the true standard score of stu- 
dent Z? 


Test A is selected by you and given to your class with the following 
results, M — 100 jo = 40. Comment on these results. 


Test B is selected by you and given to your class with the following results, M — 250; 
7 — 30. Comment on these results. 
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An arithmetic test is reported to have a standard error of measure- 


ment of 10. 


Estimate the reliability of this test when it is given to a class with 


mean 100 and standard deviation 20. 


Estimate the reliability of this test when it is given to a class with 


mean 200 and standard deviation 10. 


Estimate the reliability of this test whe 
mean 150 and standard deviation 40. 


Various standard error formulas, together with erroneous formul 


TM 
T12 


reliability coefficient of test 1. 

the correlation between tests 1 and 2 
forms of the same test. 

71 = the standard deviation of the test scores for test 1. 


n it is given to a class with 


as: 


which are not necessarily different 


4 1l. e1 VÀ — rg 6. c V2 V1-— nd 


2. aV 1 — ru? 7. V2 Vi — Tb» 


3. ei V1 — rj» 8. «v2 vi =r 


4, aV = rn? 9. aV ri V 


5. aV2 Vl-—rnu 


For each of the followin 
clearly suggested. 


g items write the number of the one formula w 


Tru 


hich is most 


By how much does this student's test score deviate from his true Score? 


— — What is the extent of the error I 
by a scholastic aptitude test? 


Mr. A is using one form of the Otis test, Mr. B is usin 
same test. By how much are their Scores fi 


I have given ten different forms of thi. 


s test to Mr. X. 
standard deviation of this distribution 


of ten scores? 


The formula for the standard error of measurement, 
The formula for the standard error of substitution, 


The formula for the standard error of estimate, 


am likely to make in estimating college grades 


E a parallel form of the 
or the same people likely to differ? 


Can I estimate the 


4 
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The standard deviation of the distribution of differences between scores on 
parallel forms of a test. 


— — The smallest standard error in the group. 


The standard deviation of the errors made in regarding the obtained score as 
the true score. 


— — The error made in predicting true score from the fallible scores. 


Indicate the type of scores to which each of the following statements refers by 
using the following code. Give the one best answer for each item. 


. Raw score. 5. Absolute scores. 

. Standard score. 6. Mental age scores. 

. Percentile scores. 7. 1.Q. scores. 

. Normalized scores. 8. None of the foregoing scores. 


Ep l2 


—— Gives a linear plot with chronological age for ages 2 to 10 years, if the average 
score for large groups is used. 


— — This distribution must be Gaussian. 

The frequency distribution of these scores is rectangular. 
These scores are linearly related to raw scores. . 

The origin of these scores is at zero ability. 


The unit of measure is in some testable sense constant at different points on 
the scale. 


Tn groups that are homogeneous with respect to attainment level, this score is 
likely to be correlated negatively with chronological age. 


There is a procedure for checking to see whether or not these scores may be 
applied to a given set of data. 


If the raw score distribution is Gaussian, the plot of these scores against stand- 
ard scores will be linear. 


The result of the plot of these scores against normalized scores is the integral 
of the normal probability curve. 


These scores are comparable from distribution to distribution, in the sense 
that different groups have the same mean and standard deviation. 


These scores assume that all differences in the frequency distributions for dif- 
ferent groups are due solely to the peculiarities of the test. 


These scores assume that after all rank order is the important thing to consider. 
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A test of arithmetic ability is given to each of three classes. In class A the testing 
conditions are excellent. In class B the testing conditions are about the same as in 
class A, except that an oversight of the tester allows the class two minutes more 
than given to class A to work the test. In class C the testing conditions are not uni- 
form; after the test was over it was found that about. one-third of the class had mis- 
understood the nature of the test and had answered the questions with a bias that 
influenced the correctness of the responses. The following were the results obtained: 


Average 
Number Average 
uni kal E 3 Ru 
Attempted j 
Class A 50 5 100 2.1 100 
Class B 55 6 120 1.9 124 
Class C 45 T 90 2.4 81 


No person in any of the three classes finished the test. There were practically no 
items skipped. 
For each of the questions below write 


the number of the item that best applies. 
Use the following code: 


1. Class A. 5. False. 
2. Class B. 
3. Class C. 
4. True. 


6. Can't tell from data given. 
7. Nonsense, 


Would the reliability of the test be greater as ealculated on class B or on class A? 


In which class would the test reliability be greatest? 


Would the test reliability be greater when calculated on class A or on class C? 


— — Which class was best on the ability tested? 


The reliability of the test calculated o; 


n the combined Scores of cl 
would be greater than that calculat 


asses A and C 
ed on the combined Score: 
and B. 


S of classes A 


—— The Kuder-Richardson formula for reliability can legitimately be applied to 
the results of class A. 


The Kuder-Richardson formula could legitimately be applied to class B. 


The Kuder-Richardson formula could legitimately be applied to class C. 


— — If an intelligence test were given to the three classes, which class would most 
likely have the lowest mean score? 


—— The correlation between the scores in class A and class B would probably be 
higher than that for classes A and C. probably 


—S B — — 
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Miscellaneous formulas: 


ru = a reliability coefficient. 

validity coefficient. 

c = the true standard deviation of the test. 

s = the standard deviation of the test scores. 

e = the standard error of measurement. 

n = the number of times a test is increased in length. 
1 = subscript designating the test. 

0 = subscript designating the criterion. 


1 T10 4. on 
i Vru 5. eVn 
9, Mw. 6. ns*(nr +1 — r) 
7" rp 7. sVn Var +1 —r 
3 nm ^ on 


2 
Nr . en 
T1100 
For each of the following items write the number of the one formula that is most 
clearly suggested. 
—— The true correlation between test and criterion. 
—— The error variance of a test when it is increased in length. 
—— The standard deviation of the raw scores of the augmented test. 


Tf I quadruple the length of this test and give it again to the same group of 
students, what will happen to the true variance? 


— I should like to know how much this test would correlate with my criterion, if 
it were possible to measure the criterion with a reliability of unity. 


— — Can I estimate the correlation that would exist between college grades and 
intelligence, if it were not for the errors of measurement in both variables? 


___ This test has a standard deviation of 20 and a reliability of .80. What will the 
standard deviation probably be if I quadruple the length of the test? 


Give the numerical answer to the foregoing question here 


— — This aptitude test has a reliability coefficient of .81, a validity coefficient of 
.64, a standard deviation of 30, and an error of measurement of 18+. I should 
like to estimate the validity coefficient the test would have if it were made per- 
fect, and correlated with the same criterion scores as before. 


Give the numerical answer to the foregoing question here 


— ——. Does the error of measurement of a test increase, diminish, or remain constant 
as the test is increased in length? 


— If I quadruple the length of this test would I expect any change in the standard 
deviation of the true scores? 


456 The Theory of Mental Tests 
Formulas showing the relationship between test len 


gth, heterogeneity, reliability, 
and validity: 


reliability coefficient for test of unit length. 

validity coefficient for test of unit length. 

reliability coefficient, when altered by increasing either the length of the 
test or the heterogeneity of the group. 

validity coefficient altered by increasing the length of the test. 

as either coefficients or subscripts indicate the length of the augmented test. 
tandard deviation of scores of the test of unit length. 

standard deviation of scores of the test when it is altered either by in- 
creasing the length of the test or the heterogeneity of the group. 
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For each item write the number of 


the one formula that is m 
Give only one answer except where mi 


> ost clearly suggested. 
ultiple answers are indica 


ted. 
- This test has a reliability of .81. I should like to raise the reliability to .95, 


—— In working with test X, Mr. B re 
ard deviation 35. Mr. C uses th 


e same test and r 
of .95. s 


—— I have a vocabulary test of 300 it 
and reliability .76. Can I estimat 
ulary test of 500 items, with a reli 
ation 21? 


—— These formulas depend upon the assum 
urement of a test is invariant with resp 
the group taking the test. 


—— This intelligence test has a reliability 
only .50. I wonder if I could make 


es 
increase to .70. 


— A given college entrance examination 
standard deviation 15. The same ex 
discovered that the average score is 1 


has a reli 
amination 
80 and the 


ports a reliability of .84 witi 


ems with mean 21 
e its correlation wi 
ability of 81, mea: 


ption that the sta; 
ect to variations in 
(Multiple answer possible.) 


h mean 112, stand- 


eports that he gets a reliability 


0, standard deviation 15, 
th another similar vocab- 
n 351, and Standard devi- 


ndard error of meas- 
the heterogeneity of 


of .80, but its correlation with grades is 
the test so long that its validity would 


ability 
is give 
Standa 


of .80, mean 190, and 
n next year, and it is 
Td deviation ig 25. 
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— — On this vocabulary test, I find that the odd-even correlation is .81. 


a This 30-minute test has such a low validity that I wonder if its validity would 
be appreciably changed by making it into a 2-hour test. There would be space 
in the testing program for a test as long as that. 


—— The 2-hour final examination that I have been giving for the last two years 
has a distressingly low reliability. Would it be worth while to consider giving 
a 6-hour final of the same type? 


—— This formula can be readily derived from the correction for attenuation. 


A test is given to two different groups with these results: 


Mean c N 
$ Group A 100 20 200 
Group B 150 10 400 


| Mark each of the following items: 
A if it will be larger in group A. 
B if it will be larger in group B. 
S if it will be about the same in both groups. 
O if one cannot tell from the data given. 
N if the statement is nonsense. 


— — The reliability coefficient of the test. 
— —— The standard error of measurement of the test. 
—— The average achievement level of the group. 


—— The reliability coefficient of the test if it is made four times as long. 


— —— The correlation between the scores of group A and the scores of group B. 


— — The ratio of the standard deviation of odd-item scores to the standard devi 
tion of even-item scores. s 


'The standard deviation of the difference betweer i 
—! en odd-item scores s 
item scores. and vie 


—— The slope of the line of regression of even-item scores on odd-item scores 
—— The true variance. 


: The correlation between true scores and error scores. 
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Given a test with a reliability coefficient (rjj) -84, length (k) 100, and standard 
deviation (c) 20. 

Estimate each of the following, giving: 


(a) The general formula expressed in symbolic form. 
(b) The numerical answer for the particular case. 


1. Reliability coefficient if the test is altered to length N (N = 250). 


(a) Formula 


(b) Numerical answer 


2. Length necessary if one is content to use a test with a reliability of .72. 


(a) Formula 


(b) Numerical answer 


3. The error variance for length N (N = 250). 


(a) Formula 


(b) Numerical answer 


4. The true score variance for length N (N = 250), 


(a) Formula 


(b) Numerical answer 


5. The obtained standard deviation for length N (N = 250) 


(a) Formula 


(b) Numerical answer 


Appendix D 459 


6. The correlation between the test of length & (100) and length N (250). 


(a) Formula 


(b) Numerical answer 


A class of 200 students is given the L and M forms of the Stanford Binet test. 

Below are given a number of different ways of estimating the reliability coefficient 
of the Stanford Binet. 

In column A mark each of these methods: 


1 if it is the best method of estimating reliability. 
+ if it is a reasonably good method of estimating reliability. 
0 if it is a method that could not give an estimate of reliability. 
In column B mark each of these methods: 
+ if it is necessary to use the Spearman-Brown correction. 
0 if it is not necessary to use this correction. 
A B 


Correlation of score on odd items with score on even items on form L. 


Correlation of score on the first half of the test, with score on the second 
half on form M. 


Correlation of score on form M with score on form L. 


Use of the Kuder-Richardson formula (simplest form using only mean 
standard deviation and number of items) on the items of form L. 


—— — Give form M again and correlate scores on the first giving with those 
on the second for form M. 


A and B are comparable halves of a test. rag = .60. The standard deviation of 
A is 14, and its mean is 103; corresponding figures for B are 26 and 106, respectively. 
Comment on the foregoing data with special reference to the reliability of the total 


test. 


1 and 2 are comparable halves of a test. 13 = -90. The mean and standard devia- 
tion of part 1 are respectively 147 and 34; the corresponding figures for part 2 are 148 
and 33. Comment on the foregoing data with special reference to the reliability of 


the total test. 
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A test of some simple mechanical ability in which practice has no effeet is given 
twice to a class of 100 students. The standard deviation of each of the distributions 
is 10.0, the correlation between the two scores is .64. Assume that the distribution 

.0, ! 
is normal, homoscedastic, etc. 


1. What is the probability that the score of any given student on the first test will 
deviate by more than 6 score points from his score on the second test? 


(a) Appropriate " 
formula 


(b) Numerical value 
of appropriate 
standard error 


(c) Probability 


2. What is the probability that the score of any given student on the first test will 
deviate by more than 6 score points from the prediction made from scores on the 
second test? . 


(a) Appropriate 
formula 


(b) Numerical value 
of appropriate 
Standard error 


(c) Probability 


3. What is the probability that the score of any given student on the first test will 
deviate by more than 6 score points from his true score? 


(a) Appropriate 
formula 


(b) Numerical value 
of appropriate 
standard error 


(c) Probability 


Answers to Problems 
Note: Where discussions or derivations are called for, the answers are not given, 


Chapter 2 


1. 
Correlation 
Index of even between | Error of 
Test Reli- pns Observed | Meas- 
ms of True a 
ability and Error | urement 
Scores 
Scores 
A .95 14.25 .30 4.50 
B .92 23.64 .40 10.28 
Cc 88 9.94 AT 5.31 
D .93 71.14 .36 27.54 
E .87 19.05 49 10.73 
2. 
True Score Limits 
(Approximately 0.3 per cent level) 
(a) 115 on test A 101 .50-128 .50 
(b) 211 on test B 180 .16-241 .84 
(c) 31 on test C 15.07- 46.93 
(d) 500 on test D 417.38-582.62 
(e) 100 on test EZ 67.81-132.19 
3. 
Minimum Difference 
Test 
C=2 C -8 
A (12.726) 13 (19.089) 20 
B (29.072) 30 (43.608) 44 
[9] (15.017) 16 (22.525) 23 
D (77.883) 78 | (116.825) 117 
E (30.344) 31 (45.517) 46 
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Chapter 3 
1. A.T-X-—E D.rgg,-0 
B. Mg =0 E. Xi = X.=--- = X, 
C. rp, = 0 F. sı = s2 = +++ = % 
Chapter 4 
1. Sa Sa Se Se 
A. 5.66 5.43 4.00 3.67 
B. 3.74 3.47 2.65 2.24 
C. 20.84 20.47 14.74 14.21 
D. 12.10 11.76 8.56 8.07 
E. 10.08 9.51 7.13 6.30 
12. (a) se = 6.30 (c) 135.9 > T4 > 98.1 
(b) se = 6.30 (d) 113.9 > Tg > 76.1 
Chapter 6 
Estimated 
3. Test Reliability 
A -98 
B .84 
[6] .93 
D .95 
E .91 
Chapter 7 
3. (a) 619.35; (b) 198.69; (c) 3.18; (4) 5.50; (e) 31 items; (f) 300 items; (g) 27 items 
Chapter 8 
3. $= sy 


Tij$iS; = Tus? 
where sj? is the variance of one unit test, 


s is the average variance of all unit. tests, 
ru is the reliability of one unit test, and 


rij5;5; is the average covariance of all unit tests. 


4. (a) .97; (b) -96; (c) 3.89 times as long or 78 items: (d) 50 items: j . 
(f) .88; (g) 240 items. t items; (e) 32 items; 
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Chapter 9 


1. (a) rir run and (b) res = nr, 
TII 
2. Assumed that (ras Vrn) is not greater than Vin 
4. .90. 
b. .64. 
6. 


(a) Mean 55, standard deviation 13.15, reliability .90, validity .76. 
(b) 34 new items, .54 new validity. 
(c) Test E. 
(d) Test A. 
(c) Test A. 
) A B Cc D E 
.96 .68 87 89 .92 


(g) .77. 
(h) k = 3.86 or 4. 
(i) True variance of test C = 100.75 (true variance of test C increased 
to 300 items = 906.75). 
Error variance of test C = 13.74 (error variance of test C increased 


to 300 items — 41.22). 
(j) Reliability of lengthened test — .97. 
Reliability of lengthened criterion — .82. 
Validity of lengthened measures = .79. 
(k) .77. 
T. Test X items. 
Chapter 10 


2. (a) Reliability about .93. (b) Standard deviation about 6.2. (c) Yes. (d) No. 
(c) Time limit decreased. (f) Test E is unsuitable for sectioning a group with stand- 
ard deviation of 3.9. 


Chapter 11 


5. (a) .84; (b) 26.9. 6. (a) .90; (b) 42.7. 7. (a) .60; (b) 320.95. 9. 13.3 
10. (a).57; (b) 13.3. 


Chapter 12 
3. Ryz =.80. Rxz = .77. 


Chapter 14 


Note. To facilitate computational checks, quantities such as D, B, s? (or w), 
s?r (or w), and v as well as L, and —N logio L are given. 
In answering the question "Are the tests parallel?" the following convention was 


used: 
Yes indicates p-value greater than .05. 
No indicates p-value less than .01. 
? indicates p-value between .05 and .01. 
indicates that the test for equality of means cannot be made because the 
data are not in agreement with Hve (or Dye). 
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Chapter 15 
; (©) 845 (c) 84. 
5 92; (b) 89. 5. .93. 6. (a) 91; 0) 84; 
err ree ry 84; (1) 5.04; (e) (35.9-66.1), (67.9-88.1), 
FA » = = 99; (9) about 484 items; (i) 48; (j) 97. 
(f) .93, .96, .98, .99; 
Chapter 16 
i. (89), (85). 2. (81), (74). 3. (o) (91); (b) (89), 


Chapter 17 
T ed (3200) 4- 1.544 — (1.544)? = 3.46 
b d 500 
2. Frequencies 372 21 13 11 159910551) 6 29 Boo G 4 : 
Score 0128456 789101112 13 14 15 16 17 18 16 
My, = 1.54 
Su = 3.46 
3. 97 >R> .g2 
97 > R’ ».80 
4, Lower Bound for Reliability Coefficient 
A .96 .96 
B .00 71 
Cc .62 -76 
D -83 85 
E —3.5 or zero .59 
B. Error of Measurement, Upper Bound 
A 4.75 5.19 
B 4.31 8.74 
[e] 3.24 7.09 
D 3.25 3.29 
E 4.75 7.19 
Chapter 19 
à. "um DAM Cs Sunn 
"xeceding z + 26 
(a) 10 5 15 
(b) 50 25 61 
6 4 35 8 
(220 ig 29 
(e) 1 0.9 
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Answers to Problems 467 
9. 


10. (a) wi = 7.353A; + 138.97 
(b) w'; = 7.132B; + 88.21 

3. (a) Mean = 111.5039; standard deviation = 22.1712 

(b) wi = 1.26584; — 74.174 

(c) w'; = 1.036B; — 67.419 


Chapter 20 


i. O17; 2i 100. 8. = 


= 386X, —.342X, + .905N, + 315.550. 4. 
5. (a)and(s). 6. X, = 


48. 
= .2Xa + .9X,+ 97. 7. 


-721. 17. (a) .57; (0).63; (c) .69. 


Chapter 21 
4. Delete items 2, 5, 22, 31, 32. 
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error variance effect on, 114 

essay examinations, 211-214, 216 

experimental work showing effect of 
length on, 65-67 

formula for content, 214, 216 

formula for weighting to maximize, 
347, 357 

funetion of item difficulty and test 
variance, 221-224, 226 

function of test mean and variance, 
225-227 

function of the variance of a difference, 
199 

function of variance of half-test scores, 
199 

graphs illustrating effect of length on, 
79-82 

heterogeneity of group, effect on, 110- 
114, 124 

index of, 22-23, 27, 32-33, 37 

instability of a trait, 197 

invariant function of (with changes in 
group variability), 140—141, 143 

invariant function of (with changes in 
test length), 83-85, 86, 94, 101, 
105 

item homogeneity measures compared 
with, 220 

item parameters related to, 378-380, 
389 

judgment of test, constructor measured 
by, 220 

Kuder-Richardson formulas for, 223- 
224, 295-296 

length of test effect (general case), 77- 
79, 86 


Reliability, length of test necessary for 
specified, 82-83, 86 
matched random subtests, 207-210, 
215 
odd-even, affected by time limits, 236- 
238, 242 
odd versus even items, 205-207, 215- 
216 
parallel tests used for computing, 194— 
197, 214-215 
parallel thirds, 207-209, 216 
reader, 211-213, 216 
selection effects on, 110-114, 124 
several subtests recommended, 201 
single common factor related to, 220- 
221 
Spearman-Brown formula, 63, 67, 78, 
86 
speeded tests, 201-203, 205-207, 215 
speeded test, lower bound for, 236-238, 
242 
‘split-half formulas, 198-201, 216 
split-half formulas compared, 200-201, 
216 
split-half methods, 198-210, 215 
statistical criteria for equality in paral- 
lel tests, 175, 177, 185, 186-187 
successive halves, 201-205 
test-retest, 197-198, 215 
testing conditions effect on, 108-110 
time limits as affecting, 201-203, 205- 
207, 215 
true variance effect on, 110-114 
used in weighting formula, 331-334, 
356 
weighting to maximize, 346-348, 349- 
350, 357 
Reliability index, item selection affects, 
379-380, 384 
of an item, 377, 378, 379, 382, 383, 384, 
385, 387, 388, 389, 390 
Reliability of difference scores, 352-354, 
359 
computing diagram for, 354 
Restriction of range (see Selection of 
group) 


Score matrix, definition of, 427 
sum of terms in, 427-428 
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Score transformations, purposes of, 266- 
268 
types of, 267 
Scores, absolute, 283-284, 284-286, 306 
arbitrary linear transformations of, 
265-266, 304 
chance, mean of, 263, 304 
variance of, 263, 304 
criterion predicted from, 291-293, 
306 
critical points and error of measure- 
ment, 264-265, 304 
error, 4-6, 25, 28, 36 
graphic transformation to arbitrary 
scale, 265-266 
gross or raw, 424 
linear derived, 272-276, 305 
computing procedures, 274-275 
definition of, 272-274 
properties of, 276 
various types of, 272-273 
McCall's T-score, 282-283, 306 
non-chance, 263-265, 304 
normalized, 280-282, 305 
computing procedures, 282 
definition of, 280-281 
properties of, 280-282 
per cent of perfect, 264-265 
percentile, 276-280, 305 
computing procedures, 277-279 
correlation of, 280 
definition of, 276-277 
properties of, 279-280 
Scaled Scores of Cooperative Tests, 
283-284, 306 
selection effect on cutting, 292 
standard, 268-272, 305 
computing procedures, 270-272 
definition of, 268 
properties of, 272 
time, 252-255, 261 
true, 4-6, 25, 28, 36 
Scoring formula, function of number cor- 
rect and errors, 249, 260 
function of number correct and num- 
ber blank, 248, 260 
function of number correct and num- 
ber unattempted, 250, 260 
maximize item-criterion correlation, 
257, 261 


Scoring formula, mean criterion differ- 
ences, 256-257, 261 
multiple correlation for, 255-256 
number correct, 246, 259 
"rank-order" items, 258-259, 261 
t-test, used in, 256-257 
time and error scores, 253-254, 261 
weighting rights and errors, 255-256 
Selection of group, computing diagram, 
for explicit, 137 
for incidental, 134 
for relative effect on variance of ex- 
plicit and incidental, 139 
correlation between incidental and ex- 
plicit selection variables, variances 
known for explicit, 137-138, 142, 
148 
variances known for incidental, 133, 
142, 151-152, 156 
correlation between two incidental se- 
lection variables, variances known 
for explicit, 149-150, 156 
variances known for incidents al, 153, 
156 
effects of, illustrated, 109, 128-130, 
135-136, 145-146 
explicit, definition of, 130-131, 141 
effects of, 135-138, 148- 150 
formulas for, 136-137, 142, 148-150, 
156 
multivariate case, 165-166, 170 
incidental, definition of, 130-131, 141 
effects of, 132-135, 150-155 
formulas | for, 133, 142, 150-153, 156 
multivariate ense, 166- -170, 171 
invariant funetion of reliability and 
validity, for explicit, 141, 143 
for incidental, 140, 143 
item diffieulty parameters related to, 
367-371, 392-393 
multivariate, basic assumptions for, 
162, 170 
basic definitions for, 158-159, 161 
effect on correlation, 165-170, 171 
effect on variance, 166, 168, 169, 
171 
practical importance of corrections for, 
145-146 
relative effect on variance of explicit 
and incidental, 138- 140, 142 
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Selection of group, reliability affected by, 
110-114, 124 
univariate, basic assumptions for three- 
variable case, 146-148, 155 
basic assumptions for two-variable 
case, 131-132, 141-142 
effect on correlation between inci- 
dental and explicit selection vari- 
ables, 133, 137-138, 142, 148, 151- 
152, 156 
effect on correlation between inci- 
dental selection variables, 149- 
150, 153, 156 
effect on variance (standard devia- 
tion), 110, 124, 135, 138, 142, 148, 
151, 156 
variance (standard deviation) of ex- 
plicit selection variable, a function 
of incidental selection variance, 
135, 142, 151 
variance (standard deviation) of inci- 
dental selection variable, a func- 
tion of explicit selection variance, 
138, 142, 148 
a function of incidental selection 
variance of a second variable, 151— 
152, 156 
Selection of items, problems in theory of, 
391-394 
reliability index affected by, 379-380, 
384 
theory summarized, 388-391 
validity index unaffected by, 383, 384 
Skewness, effect on error of measurement, 
123 
equating of tests affected by, 296, 307 
Socinl Science Research Council, 392, 414 
Spearman-Brown formula, experimental 
work on, 65-67 
for double length, 63, 67 
general case, 77-79, 86 
graphs illustrating, 79-82 
Spearman's correction for attenuation, 
101-104, 105 
Speeded test, correction ofscore for wrong 
answers, 251-252, 260 
definition of, 230-233, 241 
effect of guessing on score, 246-251 
error of measurement for, 233-236, 242 
item parameters affected by, 385-386 


Speeded test, odd-even reliability for, 
236-238, 242 
variance change in, 231-233, 241-242 
Standard deviation (see also Variance) 
Standard deviation of a difference, 40, 45, 
199, 216, 425 
Standard deviation of error scores, basic 
formulas, 15-17, 26, 33-34, 37 
effect of doubling length on, 61-62, 67 
effect of length on (general case), 72-73 
Standard deviation of errors of estimate, 
425 
Standard deviation of a sum, 70, 76, 425 
Standard deviation of test, effect of dou- 
bling length on, 60-61, 67 
effect of length on (general case), 69- 
71, 73 
effect of time limits on, 232-233, 241- 
242 
effect on reliability of changes in, 110- 
114, 124 
formulas showing effect of selection on, 
110, 124, 135, 138, 142, 148, 151, 
156 
item parameters related to, 375-378, 
389 
weighting scores by reciprocal of, 334- 
336, 356 
Standard deviation of true scores, basic 
formulas, 8-11, 14-15, 26, 30-32, 
34, 37 
effect of doubling length on, 61, 62, 67 
effect of length on (general case), 71, 
73 
Standardizing, error of measurement, ef- 
fect on norms, 289-290 
kurtosis, effect on norms, 296, 307 
multiple trials, effect on per cent pass- 
ing, 265, 304, 313 
per cent of perfect score used for, 264- 
265 
regression line used for, 291-296, 297- 
298, 299-304, 306, 307 
regression line used influences norms, 
287-288, 306 
reliability, effect on norms, 289-290, 
306 
skewness, effect on norms, 296, 307 
successive hurdles, effect on per cent 
passing, 265, 304, 313 
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Statisties, sample examination items, 
437—446 
Successive hurdles, effect on standards, 
265, 304, 313 
Summation, notation for correlation ma- 
trix, 429 
notation for score matrix, 427 
of terms in correlation matrix, 429 
of terms in score matrix, 427-428 
Summation sign, rules for, 426-427 ` 
use of, 426-429 
Sum, correlation of fixed test with a, 88- 
89 
Sums, correlation of two, 74-77, 85 
correlation of weighted, 316-319, 355 
standard deviation of, 70, 76, 425 
variance of, 75-76 


Test theory, sample examination items, 
447-400 
Time limits, error of measurement af- 
fected by, 283-236, 242 
reliability affected by, 236-238, 242 
variance (standard deviation) affected 
by, 232-233, 241-242 
True scores, correlation between, 101— 
104, 105 
correlation with error scores, 7, 26, 35- 
30, 37 
correlation with observed scores, 22- 
23, 27, 32-33, 37 
defined as a limit, 28, 36 
defined as remainder, 5, 8, 25 
estimation of differences, 20-22 
estimation of limits, 17-20 
general considerations, 4-6, 25 
mean of, 8, 26, 29, 37 
standard deviation of, effect on relia- 
bility, 110-114 
equation for, 14-15, 26, 30-32, 37 
illustrations of ehanges in, 108-109 
relation to error of measurement, 8- 
11, 26, 34, 37 
used in definition of parallel tests, 11, 
26 
variance, effect on reliability, 110-114 
equation for, 14-15, 26, 30-32, 37 
illustrations of changes in, 108-109 
relation to error variance, 8-11, 26, 
34, 37 


Unattempted items, mean and variance 
given by item analysis, 238-240, 
242-243 

odd-even reliability affected by, 236- 
238, 242 
variance affected by, 231-233, 241-242 

United States Army, 267, 273 

United States Army Air Forces, 207, 418 

United States Navy, 209, 265, 267, 273, 
304 

United States War Department, 418 

University of Chicago, The, 2, 40(n), 
204, 218(p), 229(p), 272, 286, 418 


Validity, effect of explicit selection on, 
137-138, 142, 148-150, 156 
effect, of incidental selection on, 133, 
142, 151-153, 156 
effect of infinite length on, 95-98, 101- 
104, 105 
effect of multivariate selection on, 158- 
171 
effect of test length on (general for- 
mula), 88-90, 98-101, 104 
effect of univariate selection on, 145- 
156 
for true scores, 95-98, 101-104, 105 
function of, invariant with length of 
test, 94, 105 e. 
invariant with selection of group, 
140-141, 143 
graphs showing effect of infinite length 
on, 96, 103 
graphs showing effect of test length 
(general ease), 91, 92, 100 
illustrations of effect of selection pro- 
cedures on, 128-130, 135-130, 
145-146 
item difficulty related to, 374-375 
M acr related to, 380-385, 
length necessary to attain a specified, 
90, 91, 93-94, 104 
maximized by graphic method, 382- 
384 
of early tests, 1 
statistical criteria for equality in paral- 
lel tests, 185, 186-187 
Validity index of an item, computing 
formula for, 387-388, 390 
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Validity index of an item, definition, 382, 
389 1 
invariant with item selection, 383, 384 
relation to test validity, 380-385, 389 
Variability, quotidian and reliability, 197 
Variability of group, effect of (see Selec- 
tion of group) 
Variance (see also Standard deviation) 
Variance due to interaction between per- 
sons and tests, 50-54, 57 
Variance of a difference, 40, 45, 199, 216, 
425 
Variance of a sum, 70, 75-76, 425 
Variance of an item, 425 
Variance of error scores, effect of dou- 
bling length on, 61-62, 67 
effect of test length on (general case), 
72-13 
equation for, 15-17, 26, 33-34, 37 
relation to error of estimate, 49, 57 
relation to error of substitution, 48-49 
relation to interaction, 50-54, 57 
relation to true variance, 8-11, 26, 34— 
35, 37 
Variance of test, effect of doubling length 
on, 60-61, 67 
effect of multivariate selection on, 166, 
168, 169, 171 
effect of test length on (general ense), 
69-71, 73 
effect of time limits on, 232-233, 241- 
242 
effect of univariate selection on, 110, 
124, 135, 138, 142, 148, 151, 156 
effect on reliability, 110-114, 124 
effect on validity, 128-143 
equations for, 424 
item parameters related to, 375-378, 
389 
relative effect of explicit and incidental 
selection on, 138-140, 142 
statistical criteria for equality in paral- 
lel tests, 175, 177, 185, 186-187 
Variance of true scores, effect of doubling 
length on, 61, 62, 67 
effect of test length on (general case), 
71, 73 
equation for, 14-15, 26, 30-32, 37 
relation to error variance, 8-11, 26, 
34-35, 37 


Variance of true scores, relation to vari- 


ance due to persons, 54-57 


Weighted composites, correlation be- 


tween (general case), 316-321, 
355 

correlation between maximized, 348- 
351, 358 

correlation for random positive weights, 
321-827, 356 

derivation of correlation for random 
positive weights, 321-326 

derivation of correlation (general case), 
316-319 

derivation of weights for maximum 
correlation, 348-349 

formula for correlation between (gen- 
eral case), 319, 355 

formula for correlation, random posi- 
tive weights, 326, 356 

formula for subtest effect on, 339-340, 
357 

multiple cutting score compared with, 
312-314 

reliability maximized, 346-348, 349- 
350, 357 

subtest effect on, 338-341, 357 

subtest effect on, illustrated from basic 
engineering school data, 340-341 

to predict criterion, 327-330 


Weighting, by use of indifference fune- 


tion, 254 
of rights and wrongs, 255-256 
of time and error scores, 254 


Weighting coefficients, determined by ex- 


pert judgment, 254-255, 341-342, 
357 

determined from errors of measure- 
ment, 336, 357 

determined from number of items, 
336-338, 356 

determined from perfect scores, 336- 
338, 356 

determined from reliability coefficients, 
331-334, 356 

determined from standard deviations, 
334-336, 356 

determined from test means, 336-338, 
356 . 

general considerations, 314-315, 355 
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Weighting coefficients, multiple (see Mul- 
tiple regression weights) 
multiple cutting scores compared with, 
312-314 
random positive, 321-327, 356 
variables characterizing, 314-315, 319- 
321, 326, 355-356 
Weighting of item alternatives, to maxi- 
mize item-criterion correlation, 
257, 261 
Weighting of items, to maximize criterion 
correlation, 327-330 
Weighting of scores, factor analysis used 
for, 343-345, 358-359 
to equalize marginal contribution to 
variance, 345-346, 357 


Weighting of scores, to give common fac- 
tor, 344-345, 358-359 
to give first centroid axis, 344, 359 
to give first principal axis, 343-344, 358 
to maximize reliability, 346-348, 349- 
350, 357 
to maximize validity (see Multiple cor- 
relation) 
to maximize variance of composite, 
343-344, 359 
to minimize generalized variance, 343- 
344, 359 
to minimize intra-individual variance, 
343-344, 359 
Weighting of scores and criteria for maxi- 
mum intercorrelation, 348-351, 358 
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